0
0
Apache Sparkdata~10 mins

Broadcast joins for small tables in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Broadcast joins for small tables
Start with two tables
Identify small table
Broadcast small table to all nodes
Perform join locally on each node
Combine results from all nodes
Output joined table
Broadcast join sends the small table to all worker nodes to join locally with the big table, speeding up the join.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
spark = SparkSession.builder.getOrCreate()

big_df = spark.range(1000000).withColumnRenamed("id", "key")
small_df = spark.createDataFrame([(1, "A"), (2, "B")], ["key", "value"])

joined_df = big_df.join(broadcast(small_df), "key")
joined_df.show(5)
This code joins a big table with a small table using broadcast join to speed up the process.
Execution Table
StepActionDetailsResult
1Create big_dfGenerate 1,000,000 rows with 'key' columnbig_df with keys 0 to 999,999
2Create small_dfCreate small table with 2 rows (keys 1 and 2)small_df with keys 1 and 2
3Broadcast small_dfSend small_df to all worker nodessmall_df available locally on each node
4Join big_df and small_dfJoin on 'key' column locally on each nodeRows with keys 1 and 2 matched with values A and B
5Collect resultsCombine joined rows from all nodesJoined DataFrame with matching rows
6Show outputDisplay first 5 rowsRows with keys 1 and 2 and their values shown
7ExitJoin completeProcess ends
💡 Join completes after all matching rows are combined and displayed.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4After Step 5Final
big_dfNone1,000,000 rows with keys 0-999,9991,000,000 rows1,000,000 rows1,000,000 rows1,000,000 rows1,000,000 rows
small_dfNoneNone2 rows with keys 1 and 2Broadcasted to all nodesBroadcastedBroadcastedBroadcasted
joined_dfNoneNoneNoneNoneRows with keys 1 and 2 joinedJoined rows collectedJoined DataFrame with matched rows
Key Moments - 3 Insights
Why do we broadcast the small table instead of the big table?
Broadcasting the small table is efficient because it fits in memory and sending the big table would be too large and slow. See execution_table step 3 where only small_df is broadcasted.
What happens if the small table is not broadcasted?
Without broadcasting, Spark shuffles data across the network, which is slower. Broadcasting avoids this by sending the small table to all nodes upfront (execution_table step 3 vs step 4).
How does broadcasting improve join speed?
Broadcasting allows each node to join locally without waiting for data from others, reducing network traffic and speeding up the join (execution_table steps 3 and 4).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, at which step is the small table sent to all worker nodes?
AStep 3
BStep 4
CStep 2
DStep 5
💡 Hint
Check the 'Action' column for broadcasting in execution_table.
According to variable_tracker, what is the state of joined_df after step 4?
ANone
BRows with keys 1 and 2 joined
C1,000,000 rows
DBroadcasted small_df
💡 Hint
Look at joined_df row under 'After Step 4' in variable_tracker.
If the small table had 1 million rows, what would likely happen to the broadcast step?
ABig table would be broadcasted instead
BBroadcast would still be efficient
CBroadcast would fail or be slow due to size
DJoin would happen without broadcast
💡 Hint
Recall why small tables are broadcasted from key_moments and execution_table step 3.
Concept Snapshot
Broadcast joins send a small table to all worker nodes.
This avoids shuffling the big table across the network.
Use broadcast() on the small DataFrame in Spark.
Join happens locally on each node.
Speeds up joins when one table is much smaller.
Full Transcript
Broadcast joins in Apache Spark work by sending the small table to all worker nodes. This lets each node join the small table locally with the big table, avoiding slow network shuffles. The process starts by creating the big and small tables. Then the small table is broadcasted to all nodes. Each node performs the join locally. Finally, results are combined and shown. Broadcasting is efficient only if the small table fits in memory. This method speeds up joins when one table is much smaller than the other.