0
0
Apache Sparkdata~20 mins

Avoiding shuffle operations in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Shuffle Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of join without shuffle
Consider two Spark DataFrames df1 and df2, both partitioned by the same column 'id'. What will be the output count of the following join operation?
Apache Spark
df1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['id', 'val1']).repartition('id')
df2 = spark.createDataFrame([(1, 'x'), (2, 'y'), (4, 'z')], ['id', 'val2']).repartition('id')
joined = df1.join(df2, 'id')
result_count = joined.count()
print(result_count)
A2
B3
C1
D4
Attempts:
2 left
💡 Hint
Think about which ids are common in both DataFrames.
🧠 Conceptual
intermediate
1:30remaining
Why avoid shuffle in Spark?
Why is it important to avoid shuffle operations in Apache Spark when possible?
AShuffle operations increase the number of partitions automatically, which is always bad.
BShuffle operations are expensive because they involve disk and network I/O, slowing down the job.
CShuffle operations cause Spark to lose data, leading to incorrect results.
DShuffle operations prevent Spark from caching data in memory.
Attempts:
2 left
💡 Hint
Think about what happens behind the scenes during a shuffle.
🔧 Debug
advanced
2:00remaining
Identify shuffle causing operation
Given the following Spark code, which line causes a shuffle operation?
Apache Spark
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['id', 'val'])
df2 = df.filter(df.id > 1)
df3 = df2.groupBy('val').count()
df4 = df3.orderBy('count')
df4.show()
Adf3 = df2.groupBy('val').count()
Bdf2 = df.filter(df.id > 1)
Cdf4 = df3.orderBy('count')
Ddf4.show()
Attempts:
2 left
💡 Hint
GroupBy operations usually cause shuffle.
🚀 Application
advanced
2:30remaining
Avoid shuffle in join by partitioning
You have two large DataFrames df1 and df2 both partitioned by 'user_id'. You want to join them on 'user_id' without causing shuffle. Which approach is correct?
ASort both DataFrames by 'user_id' before join.
BUse df1.join(df2, 'user_id') directly without repartitioning.
CCollect df2 to driver and broadcast join with df1.
DRepartition both DataFrames by 'user_id' before join.
Attempts:
2 left
💡 Hint
Matching partitioning keys avoids shuffle in join.
visualization
expert
3:00remaining
Visualize shuffle stages in Spark UI
You run a Spark job with multiple transformations including groupBy and join. Which visualization in Spark UI helps you identify shuffle stages and their cost?
AThe Executors tab showing memory usage.
BThe SQL tab showing executed queries.
CThe DAG visualization showing stages and shuffle dependencies.
DThe Storage tab showing cached RDDs.
Attempts:
2 left
💡 Hint
Look for visual representation of stages and data movement.