0
0
Apache Sparkdata~20 mins

Broadcast joins for small tables in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Broadcast Join Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of broadcast join with small table
What is the output count of the following Spark code snippet that uses a broadcast join?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.getOrCreate()

# Large DataFrame
df_large = spark.createDataFrame([(1, 'A'), (2, 'B'), (3, 'C')], ['id', 'value'])

# Small DataFrame to broadcast
small_df = spark.createDataFrame([(1, 'X'), (3, 'Y')], ['id', 'desc'])

# Broadcast join
joined_df = df_large.join(broadcast(small_df), 'id', 'inner')

print(joined_df.count())
A3
B1
C0
D2
Attempts:
2 left
💡 Hint
Think about how many matching keys exist in both DataFrames.
data_output
intermediate
2:00remaining
Resulting DataFrame after broadcast join
What is the content of the DataFrame after this broadcast join?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([(10, 'apple'), (20, 'banana'), (30, 'cherry')], ['key', 'fruit'])
small_df = spark.createDataFrame([(10, 'red'), (30, 'dark red')], ['key', 'color'])

result = df1.join(broadcast(small_df), 'key', 'left')
result.show()
A[Row(key=10, fruit='apple', color='red'), Row(key=20, fruit='banana', color='yellow'), Row(key=30, fruit='cherry', color='dark red')]
B[Row(key=10, fruit='apple', color=None), Row(key=20, fruit='banana', color=None), Row(key=30, fruit='cherry', color=None)]
C[Row(key=10, fruit='apple', color='red'), Row(key=20, fruit='banana', color=None), Row(key=30, fruit='cherry', color='dark red')]
D[Row(key=10, fruit='apple', color='red'), Row(key=20, fruit='banana', color='red'), Row(key=30, fruit='cherry', color='dark red')]
Attempts:
2 left
💡 Hint
Remember that a left join keeps all rows from the left DataFrame and adds matching rows from the right.
🔧 Debug
advanced
2:00remaining
Identify the error in broadcast join code
What error will this Spark code raise when executed?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.getOrCreate()

large_df = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'val'])
small_df = spark.createDataFrame([(1, 'X'), (2, 'Y')], ['key', 'desc'])

joined = large_df.join(broadcast(small_df), large_df.id == small_df.key)
joined.show()
ANo error, outputs joined DataFrame with 2 rows
BAnalysisException: 'Reference 'key' is ambiguous, could be: key, key.'
CTypeError: join condition must be a Column
DValueError: Cannot broadcast a DataFrame with more than 100MB
Attempts:
2 left
💡 Hint
Check if the join columns have the same name or if they are ambiguous.
🚀 Application
advanced
2:00remaining
Choosing broadcast join for performance
You have a large DataFrame with 10 million rows and a small DataFrame with 100 rows. Which Spark join strategy is best for performance?
AUse cartesian join to combine all rows from both DataFrames
BUse broadcast join to send the small DataFrame to all worker nodes
CUse shuffle join to redistribute both DataFrames across the cluster
DUse cross join with filter to reduce data size
Attempts:
2 left
💡 Hint
Think about minimizing data movement for small tables.
🧠 Conceptual
expert
2:00remaining
Effect of broadcast join on shuffle partitions
When using a broadcast join in Spark, what happens to the shuffle partitions compared to a regular shuffle join?
AShuffle partitions are eliminated for the broadcasted table, reducing shuffle overhead
BShuffle partitions increase because data is duplicated across nodes
CShuffle partitions remain the same as in a regular shuffle join
DShuffle partitions are replaced by a cartesian product
Attempts:
2 left
💡 Hint
Consider how broadcast join avoids shuffling the small table.