Challenge - 5 Problems

🎖️

Shuffle Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of join without shuffle

Consider two Spark DataFrames df1 and df2, both partitioned by the same column 'id'. What will be the output count of the following join operation?

Apache Spark

df1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['id', 'val1']).repartition('id')
df2 = spark.createDataFrame([(1, 'x'), (2, 'y'), (4, 'z')], ['id', 'val2']).repartition('id')
joined = df1.join(df2, 'id')
result_count = joined.count()
print(result_count)

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Why avoid shuffle in Spark?

Why is it important to avoid shuffle operations in Apache Spark when possible?

AShuffle operations increase the number of partitions automatically, which is always bad.

BShuffle operations are expensive because they involve disk and network I/O, slowing down the job.

CShuffle operations cause Spark to lose data, leading to incorrect results.

DShuffle operations prevent Spark from caching data in memory.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify shuffle causing operation

Given the following Spark code, which line causes a shuffle operation?

Apache Spark

df = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['id', 'val'])
df2 = df.filter(df.id > 1)
df3 = df2.groupBy('val').count()
df4 = df3.orderBy('count')
df4.show()

Adf3 = df2.groupBy('val').count()

Bdf2 = df.filter(df.id > 1)

Cdf4 = df3.orderBy('count')

Ddf4.show()

Attempts:

2 left

🚀 Application

advanced

2:30remaining

Avoid shuffle in join by partitioning

You have two large DataFrames df1 and df2 both partitioned by 'user_id'. You want to join them on 'user_id' without causing shuffle. Which approach is correct?

ASort both DataFrames by 'user_id' before join.

BUse df1.join(df2, 'user_id') directly without repartitioning.

CCollect df2 to driver and broadcast join with df1.

DRepartition both DataFrames by 'user_id' before join.

Attempts:

2 left

❓ visualization

expert

3:00remaining

Visualize shuffle stages in Spark UI

You run a Spark job with multiple transformations including groupBy and join. Which visualization in Spark UI helps you identify shuffle stages and their cost?

AThe Executors tab showing memory usage.

BThe SQL tab showing executed queries.

CThe DAG visualization showing stages and shuffle dependencies.

DThe Storage tab showing cached RDDs.

Attempts:

2 left