Apache Sparkdata~10 mins

Broadcast joins for small tables in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Broadcast joins for small tables

Start with two tables

↓

Identify small table

↓

Broadcast small table to all nodes

↓

Perform join locally on each node

↓

Combine results from all nodes

↓

Output joined table

Broadcast join sends the small table to all worker nodes to join locally with the big table, speeding up the join.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
spark = SparkSession.builder.getOrCreate()

big_df = spark.range(1000000).withColumnRenamed("id", "key")
small_df = spark.createDataFrame([(1, "A"), (2, "B")], ["key", "value"])

joined_df = big_df.join(broadcast(small_df), "key")
joined_df.show(5)

This code joins a big table with a small table using broadcast join to speed up the process.

Execution Table

Step	Action	Details	Result
1	Create big_df	Generate 1,000,000 rows with 'key' column	big_df with keys 0 to 999,999
2	Create small_df	Create small table with 2 rows (keys 1 and 2)	small_df with keys 1 and 2
3	Broadcast small_df	Send small_df to all worker nodes	small_df available locally on each node
4	Join big_df and small_df	Join on 'key' column locally on each node	Rows with keys 1 and 2 matched with values A and B
5	Collect results	Combine joined rows from all nodes	Joined DataFrame with matching rows
6	Show output	Display first 5 rows	Rows with keys 1 and 2 and their values shown
7	Exit	Join complete	Process ends

💡 Join completes after all matching rows are combined and displayed.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	After Step 4	After Step 5	Final
big_df	None	1,000,000 rows with keys 0-999,999	1,000,000 rows	1,000,000 rows	1,000,000 rows	1,000,000 rows	1,000,000 rows
small_df	None	None	2 rows with keys 1 and 2	Broadcasted to all nodes	Broadcasted	Broadcasted	Broadcasted
joined_df	None	None	None	None	Rows with keys 1 and 2 joined	Joined rows collected	Joined DataFrame with matched rows

Key Moments - 3 Insights

Why do we broadcast the small table instead of the big table?

What happens if the small table is not broadcasted?

How does broadcasting improve join speed?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, at which step is the small table sent to all worker nodes?

AStep 3

BStep 4

CStep 2

DStep 5

Concept Snapshot

Broadcast joins send a small table to all worker nodes.
This avoids shuffling the big table across the network.
Use broadcast() on the small DataFrame in Spark.
Join happens locally on each node.
Speeds up joins when one table is much smaller.

Full Transcript

Broadcast joins in Apache Spark work by sending the small table to all worker nodes. This lets each node join the small table locally with the big table, avoiding slow network shuffles. The process starts by creating the big and small tables. Then the small table is broadcasted to all nodes. Each node performs the join locally. Finally, results are combined and shown. Broadcasting is efficient only if the small table fits in memory. This method speeds up joins when one table is much smaller than the other.