Challenge - 5 Problems
Broadcast Join Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of broadcast join with small table
What is the output count of the following Spark code snippet that uses a broadcast join?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.getOrCreate() # Large DataFrame df_large = spark.createDataFrame([(1, 'A'), (2, 'B'), (3, 'C')], ['id', 'value']) # Small DataFrame to broadcast small_df = spark.createDataFrame([(1, 'X'), (3, 'Y')], ['id', 'desc']) # Broadcast join joined_df = df_large.join(broadcast(small_df), 'id', 'inner') print(joined_df.count())
Attempts:
2 left
💡 Hint
Think about how many matching keys exist in both DataFrames.
✗ Incorrect
The join is an inner join on 'id'. The small table has ids 1 and 3, which match two rows in the large table. So the result has 2 rows.
❓ data_output
intermediate2:00remaining
Resulting DataFrame after broadcast join
What is the content of the DataFrame after this broadcast join?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.getOrCreate() df1 = spark.createDataFrame([(10, 'apple'), (20, 'banana'), (30, 'cherry')], ['key', 'fruit']) small_df = spark.createDataFrame([(10, 'red'), (30, 'dark red')], ['key', 'color']) result = df1.join(broadcast(small_df), 'key', 'left') result.show()
Attempts:
2 left
💡 Hint
Remember that a left join keeps all rows from the left DataFrame and adds matching rows from the right.
✗ Incorrect
The left join keeps all keys from df1. Keys 10 and 30 have matching colors, key 20 has no match so color is None.
🔧 Debug
advanced2:00remaining
Identify the error in broadcast join code
What error will this Spark code raise when executed?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.getOrCreate() large_df = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'val']) small_df = spark.createDataFrame([(1, 'X'), (2, 'Y')], ['key', 'desc']) joined = large_df.join(broadcast(small_df), large_df.id == small_df.key) joined.show()
Attempts:
2 left
💡 Hint
Check if the join columns have the same name or if they are ambiguous.
✗ Incorrect
The join condition uses columns with different names but the result has duplicate column names causing ambiguity in output.
🚀 Application
advanced2:00remaining
Choosing broadcast join for performance
You have a large DataFrame with 10 million rows and a small DataFrame with 100 rows. Which Spark join strategy is best for performance?
Attempts:
2 left
💡 Hint
Think about minimizing data movement for small tables.
✗ Incorrect
Broadcast join sends the small DataFrame to all nodes, avoiding expensive shuffles and improving performance.
🧠 Conceptual
expert2:00remaining
Effect of broadcast join on shuffle partitions
When using a broadcast join in Spark, what happens to the shuffle partitions compared to a regular shuffle join?
Attempts:
2 left
💡 Hint
Consider how broadcast join avoids shuffling the small table.
✗ Incorrect
Broadcast join sends the small table to all executors, so no shuffle is needed for it, reducing shuffle partitions and overhead.