Complete the code to cache the DataFrame and avoid shuffle during repeated actions.
df.[1]()Using cache() stores the DataFrame in memory, avoiding shuffle for repeated operations.
Complete the code to avoid shuffle by using a broadcast join.
from pyspark.sql.functions import broadcast joined_df = large_df.join([1](small_df), 'id')
Broadcasting the smaller DataFrame avoids shuffle by sending it to all nodes.
Fix the error in the code to avoid shuffle by using partitioning correctly.
df = df.repartition([1])Repartitioning by a column name avoids shuffle by grouping data by that column.
Fill both blanks to create a DataFrame with partitioning and avoid shuffle on writes.
df.write.partitionBy([1]).mode([2]).save('path/to/data')
Partitioning by a column and using overwrite mode helps avoid shuffle and manage data writes efficiently.
Fill all three blanks to create a dictionary comprehension that avoids shuffle by filtering and mapping correctly.
result = [1]: [2] for k, [2] in [3] if [2] > 10
The comprehension uses k.upper() as key, v as value, iterating over data.items(), filtering values greater than 10 to avoid shuffle by reducing data.