0
0
Apache Sparkdata~10 mins

Avoiding shuffle operations in Apache Spark - Interactive Code Practice

Choose your learning style9 modes available
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to cache the DataFrame and avoid shuffle during repeated actions.

Apache Spark
df.[1]()
Drag options to blanks, or click blank then click option'
Acache
Bcollect
Cshow
Dcount
Attempts:
3 left
💡 Hint
Common Mistakes
Using collect() causes data to be moved to the driver, not avoiding shuffle.
Using show() only displays data but does not cache it.
Using count() triggers computation but does not cache.
2fill in blank
medium

Complete the code to avoid shuffle by using a broadcast join.

Apache Spark
from pyspark.sql.functions import broadcast
joined_df = large_df.join([1](small_df), 'id')
Drag options to blanks, or click blank then click option'
Acollect
Bpersist
Cbroadcast
Dcache
Attempts:
3 left
💡 Hint
Common Mistakes
Using cache() does not affect join shuffle.
Using collect() brings data to driver, not distributed.
Using persist() is similar to cache but does not broadcast.
3fill in blank
hard

Fix the error in the code to avoid shuffle by using partitioning correctly.

Apache Spark
df = df.repartition([1])
Drag options to blanks, or click blank then click option'
A'id'
B10
Cdf
DNone
Attempts:
3 left
💡 Hint
Common Mistakes
Passing an integer causes full shuffle.
Passing the DataFrame itself is invalid.
Passing None does not repartition.
4fill in blank
hard

Fill both blanks to create a DataFrame with partitioning and avoid shuffle on writes.

Apache Spark
df.write.partitionBy([1]).mode([2]).save('path/to/data')
Drag options to blanks, or click blank then click option'
A'category'
B'overwrite'
C'append'
D'id'
Attempts:
3 left
💡 Hint
Common Mistakes
Using append mode may cause shuffle on write.
Partitioning by an ID may not group data well.
Using wrong mode causes data duplication.
5fill in blank
hard

Fill all three blanks to create a dictionary comprehension that avoids shuffle by filtering and mapping correctly.

Apache Spark
result = [1]: [2] for k, [2] in [3] if [2] > 10
Drag options to blanks, or click blank then click option'
Ak
Bv
Cdata.items()
Dk.upper()
Attempts:
3 left
💡 Hint
Common Mistakes
Using k instead of k.upper() does not transform keys.
Iterating over data instead of data.items() causes error.
Filtering on key instead of value causes wrong results.