0
0
Apache Sparkdata~20 mins

Writing output with partitioning in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Partitioning Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of partitioned write with multiple columns

What will be the folder structure created when writing a DataFrame partitioned by two columns?

Consider this Spark code:

df.write.partitionBy("year", "month").parquet("/data/events")

The DataFrame has data for years 2022 and 2023, and months 1 and 2.

Which folder paths will be created inside /data/events?

A/data/events/year=2022/month=1/, /data/events/year=2022/month=2/, /data/events/year=2023/month=1/, /data/events/year=2023/month=2/
B/data/events/month=1/year=2022/, /data/events/month=2/year=2022/, /data/events/month=1/year=2023/, /data/events/month=2/year=2023/
C/data/events/year=2022/, /data/events/year=2023/, /data/events/month=1/, /data/events/month=2/
D/data/events/year=2022_month=1/, /data/events/year=2022_month=2/, /data/events/year=2023_month=1/, /data/events/year=2023_month=2/
Attempts:
2 left
💡 Hint

Partition folders are created in the order of the columns specified in partitionBy.

data_output
intermediate
2:00remaining
Number of files created with partitioning

You write a DataFrame with 3 partitions using partitionBy("category"). The DataFrame has 2 unique categories: 'A' and 'B'.

If the DataFrame has 6 total partitions before writing, how many parquet files will be created inside the output folder?

A2 files
B6 files
C3 files
D1 file
Attempts:
2 left
💡 Hint

Each partition of the DataFrame writes one file per partition folder.

🔧 Debug
advanced
2:00remaining
Error when writing partitioned data with unsupported column

What error will this Spark code produce?

df.write.partitionBy("nonexistent_column").parquet("/output/path")

The DataFrame df does not have a column named nonexistent_column.

AAnalysisException: 'Partition column nonexistent_column not found in schema'
BNullPointerException
CNo error, writes data ignoring the partition column
DFileNotFoundException
Attempts:
2 left
💡 Hint

Partition columns must exist in the DataFrame schema.

🚀 Application
advanced
2:00remaining
Choosing partition columns for efficient queries

You have a large sales dataset with columns: date, region, product, and sales_amount.

You want to write the data partitioned to speed up queries filtering by region and date. Which partitioning strategy is best?

APartition by <code>product</code>
BPartition by <code>date</code> only
CPartition by <code>region</code> only
DPartition by <code>region</code> and <code>date</code> (in that order)
Attempts:
2 left
💡 Hint

Partition columns should match common filter columns in queries.

🧠 Conceptual
expert
2:00remaining
Effect of partitioning on data skew and shuffle

Which statement best describes the impact of partitioning output data on data skew and shuffle during a Spark write?

APartitioning always reduces shuffle and eliminates data skew
BPartitioning increases shuffle and always causes data skew
CPartitioning can reduce shuffle but may cause data skew if partition keys are unevenly distributed
DPartitioning has no effect on shuffle or data skew
Attempts:
2 left
💡 Hint

Think about how data distribution affects partitioning.