Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to broadcast the small DataFrame before joining.
Apache Spark
from pyspark.sql.functions import broadcast result = large_df.join([1](small_df), 'id')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using cache() or persist() instead of broadcast()
Trying to collect() the DataFrame before join
✗ Incorrect
The broadcast() function tells Spark to send the small DataFrame to all worker nodes for efficient joining.
2fill in blank
mediumComplete the code to perform a broadcast join with a condition on 'id'.
Apache Spark
joined_df = large_df.join([1](small_df), large_df.id == small_df.id) Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using cache() or persist() instead of broadcast()
Not wrapping the small DataFrame at all
✗ Incorrect
broadcast() wraps the small DataFrame to optimize the join by sending it to all nodes.
3fill in blank
hardFix the error in the code to correctly broadcast the small DataFrame before join.
Apache Spark
from pyspark.sql.functions import broadcast joined = large_df.join([1](small_df), 'id')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Calling broadcast() after join() instead of before
Using cache() or persist() instead
✗ Incorrect
broadcast() must wrap the small DataFrame before the join, not be called after join().
4fill in blank
hardFill both blanks to create a broadcast join and select columns from both DataFrames.
Apache Spark
from pyspark.sql.functions import [1] result = large_df.join([2](small_df), 'id').select('large_col', 'small_col')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Importing broadcast but not using it
Using different functions for import and usage
✗ Incorrect
broadcast() is imported and used to wrap the small DataFrame for efficient join.
5fill in blank
hardFill all three blanks to broadcast the small DataFrame, join on 'id', and filter results.
Apache Spark
from pyspark.sql.functions import [1] joined = large_df.join([2](small_df), 'id') filtered = joined.filter(joined.[3] > 100)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using cache or persist instead of broadcast
Filtering on a non-existent column
✗ Incorrect
broadcast() is imported and used to wrap small_df; 'value' is the column filtered on.