Challenge - 5 Problems

🎖️

Partitioning Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of partitioned write with multiple columns

What will be the folder structure created when writing a DataFrame partitioned by two columns?

Consider this Spark code:

df.write.partitionBy("year", "month").parquet("/data/events")

The DataFrame has data for years 2022 and 2023, and months 1 and 2.

Which folder paths will be created inside /data/events?

A/data/events/year=2022/month=1/, /data/events/year=2022/month=2/, /data/events/year=2023/month=1/, /data/events/year=2023/month=2/

B/data/events/month=1/year=2022/, /data/events/month=2/year=2022/, /data/events/month=1/year=2023/, /data/events/month=2/year=2023/

C/data/events/year=2022/, /data/events/year=2023/, /data/events/month=1/, /data/events/month=2/

D/data/events/year=2022_month=1/, /data/events/year=2022_month=2/, /data/events/year=2023_month=1/, /data/events/year=2023_month=2/

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Number of files created with partitioning

You write a DataFrame with 3 partitions using partitionBy("category"). The DataFrame has 2 unique categories: 'A' and 'B'.

If the DataFrame has 6 total partitions before writing, how many parquet files will be created inside the output folder?

A2 files

B6 files

C3 files

D1 file

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Error when writing partitioned data with unsupported column

What error will this Spark code produce?

df.write.partitionBy("nonexistent_column").parquet("/output/path")

The DataFrame df does not have a column named nonexistent_column.

AAnalysisException: 'Partition column nonexistent_column not found in schema'

BNullPointerException

CNo error, writes data ignoring the partition column

DFileNotFoundException

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Choosing partition columns for efficient queries

You have a large sales dataset with columns: date, region, product, and sales_amount.

You want to write the data partitioned to speed up queries filtering by region and date. Which partitioning strategy is best?

APartition by <code>product</code>

BPartition by <code>date</code> only

CPartition by <code>region</code> only

DPartition by <code>region</code> and <code>date</code> (in that order)

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Effect of partitioning on data skew and shuffle

Which statement best describes the impact of partitioning output data on data skew and shuffle during a Spark write?

APartitioning always reduces shuffle and eliminates data skew

BPartitioning increases shuffle and always causes data skew

CPartitioning can reduce shuffle but may cause data skew if partition keys are unevenly distributed

DPartitioning has no effect on shuffle or data skew

Attempts:

2 left