0
0
Apache Sparkdata~5 mins

Broadcast joins for small tables in Apache Spark

Choose your learning style9 modes available
Introduction

Broadcast joins help join a big table with a small table quickly by sending the small table to all worker machines. This avoids slow data shuffling.

When you have one big dataset and one small dataset to join.
When the small table fits in memory and you want faster joins.
When you want to reduce network traffic during join operations.
When you want to improve performance of Spark SQL joins on small tables.
When you want to avoid expensive shuffle operations in distributed joins.
Syntax
Apache Spark
from pyspark.sql.functions import broadcast

joined_df = big_df.join(broadcast(small_df), on='key')

Use broadcast() around the small DataFrame to tell Spark to send it to all nodes.

The on parameter specifies the join key column(s).

Examples
Join orders (big) with customers (small) using broadcast join on customer_id.
Apache Spark
from pyspark.sql.functions import broadcast

result = orders.join(broadcast(customers), on='customer_id')
Join sales with regions on region_id using broadcast join.
Apache Spark
from pyspark.sql.functions import broadcast

result = sales.join(broadcast(regions), on=['region_id'])
Sample Program

This example creates a big table of fruits and a small table of colors. It joins them on id using a broadcast join for faster performance.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName('BroadcastJoinExample').getOrCreate()

# Create big DataFrame
big_data = [(1, 'apple'), (2, 'banana'), (3, 'cherry'), (4, 'date')]
big_df = spark.createDataFrame(big_data, ['id', 'fruit'])

# Create small DataFrame
small_data = [(1, 'red'), (2, 'yellow'), (3, 'red')]
small_df = spark.createDataFrame(small_data, ['id', 'color'])

# Perform broadcast join
joined_df = big_df.join(broadcast(small_df), on='id', how='inner')

# Show result
joined_df.show()
OutputSuccess
Important Notes

Broadcast joins work best when the small table fits comfortably in memory.

If the small table is too large, broadcasting can cause memory errors.

You can check Spark UI to see if broadcast join was used.

Summary

Broadcast joins send the small table to all worker nodes to speed up joins.

Use broadcast() function in Spark to mark a small table for broadcasting.

This reduces data shuffle and improves join performance when one table is small.