Broadcast joins help join a big table with a small table quickly by sending the small table to all worker machines. This avoids slow data shuffling.
0
0
Broadcast joins for small tables in Apache Spark
Introduction
When you have one big dataset and one small dataset to join.
When the small table fits in memory and you want faster joins.
When you want to reduce network traffic during join operations.
When you want to improve performance of Spark SQL joins on small tables.
When you want to avoid expensive shuffle operations in distributed joins.
Syntax
Apache Spark
from pyspark.sql.functions import broadcast joined_df = big_df.join(broadcast(small_df), on='key')
Use broadcast() around the small DataFrame to tell Spark to send it to all nodes.
The on parameter specifies the join key column(s).
Examples
Join
orders (big) with customers (small) using broadcast join on customer_id.Apache Spark
from pyspark.sql.functions import broadcast result = orders.join(broadcast(customers), on='customer_id')
Join
sales with regions on region_id using broadcast join.Apache Spark
from pyspark.sql.functions import broadcast result = sales.join(broadcast(regions), on=['region_id'])
Sample Program
This example creates a big table of fruits and a small table of colors. It joins them on id using a broadcast join for faster performance.
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.appName('BroadcastJoinExample').getOrCreate() # Create big DataFrame big_data = [(1, 'apple'), (2, 'banana'), (3, 'cherry'), (4, 'date')] big_df = spark.createDataFrame(big_data, ['id', 'fruit']) # Create small DataFrame small_data = [(1, 'red'), (2, 'yellow'), (3, 'red')] small_df = spark.createDataFrame(small_data, ['id', 'color']) # Perform broadcast join joined_df = big_df.join(broadcast(small_df), on='id', how='inner') # Show result joined_df.show()
OutputSuccess
Important Notes
Broadcast joins work best when the small table fits comfortably in memory.
If the small table is too large, broadcasting can cause memory errors.
You can check Spark UI to see if broadcast join was used.
Summary
Broadcast joins send the small table to all worker nodes to speed up joins.
Use broadcast() function in Spark to mark a small table for broadcasting.
This reduces data shuffle and improves join performance when one table is small.