Challenge - 5 Problems
Shuffle Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of join without shuffle
Consider two Spark DataFrames df1 and df2, both partitioned by the same column 'id'. What will be the output count of the following join operation?
Apache Spark
df1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['id', 'val1']).repartition('id') df2 = spark.createDataFrame([(1, 'x'), (2, 'y'), (4, 'z')], ['id', 'val2']).repartition('id') joined = df1.join(df2, 'id') result_count = joined.count() print(result_count)
Attempts:
2 left
💡 Hint
Think about which ids are common in both DataFrames.
✗ Incorrect
The join matches rows with the same 'id'. Only ids 1 and 2 are common, so the result has 2 rows.
🧠 Conceptual
intermediate1:30remaining
Why avoid shuffle in Spark?
Why is it important to avoid shuffle operations in Apache Spark when possible?
Attempts:
2 left
💡 Hint
Think about what happens behind the scenes during a shuffle.
✗ Incorrect
Shuffle moves data across the cluster, involving disk writes and network transfer, which slows down processing.
🔧 Debug
advanced2:00remaining
Identify shuffle causing operation
Given the following Spark code, which line causes a shuffle operation?
Apache Spark
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['id', 'val']) df2 = df.filter(df.id > 1) df3 = df2.groupBy('val').count() df4 = df3.orderBy('count') df4.show()
Attempts:
2 left
💡 Hint
GroupBy operations usually cause shuffle.
✗ Incorrect
groupBy triggers shuffle to group data by keys across partitions.
🚀 Application
advanced2:30remaining
Avoid shuffle in join by partitioning
You have two large DataFrames df1 and df2 both partitioned by 'user_id'. You want to join them on 'user_id' without causing shuffle. Which approach is correct?
Attempts:
2 left
💡 Hint
Matching partitioning keys avoids shuffle in join.
✗ Incorrect
Repartitioning both DataFrames by the join key ensures data is co-located, avoiding shuffle during join.
❓ visualization
expert3:00remaining
Visualize shuffle stages in Spark UI
You run a Spark job with multiple transformations including groupBy and join. Which visualization in Spark UI helps you identify shuffle stages and their cost?
Attempts:
2 left
💡 Hint
Look for visual representation of stages and data movement.
✗ Incorrect
The DAG visualization shows stages and shuffle dependencies, helping identify costly shuffle operations.