0
0
Apache Sparkdata~5 mins

Broadcast variables in Apache Spark

Choose your learning style9 modes available
Introduction

Broadcast variables help share data efficiently across all worker nodes in a Spark cluster without sending it multiple times.

When you have a small lookup table that all tasks need to access.
When you want to avoid sending the same data repeatedly to each worker.
When you want to improve performance by reducing network traffic.
When you need to share read-only data across multiple stages of a job.
Syntax
Apache Spark
broadcastVar = sc.broadcast(value)

sc is the SparkContext.

value is the data you want to share (like a list or dictionary).

Examples
This creates a broadcast variable from a dictionary for fast access on all workers.
Apache Spark
lookup = {'a': 1, 'b': 2, 'c': 3}
broadcastVar = sc.broadcast(lookup)
Broadcast a list of numbers to use in your Spark tasks.
Apache Spark
numbers = [10, 20, 30]
broadcastVar = sc.broadcast(numbers)
Sample Program

This program broadcasts a dictionary of fruit counts. It then maps over a list of fruits, replacing each fruit with its count from the broadcast variable. If the fruit is not in the dictionary, it returns 0.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('BroadcastExample').getOrCreate()
sc = spark.sparkContext

# Data to broadcast
lookup = {'apple': 1, 'banana': 2, 'cherry': 3}

# Create broadcast variable
broadcastVar = sc.broadcast(lookup)

# Sample RDD
fruits = sc.parallelize(['apple', 'banana', 'apple', 'cherry', 'banana', 'date'])

# Use broadcast variable in map
result = fruits.map(lambda fruit: (fruit, broadcastVar.value.get(fruit, 0))).collect()

print(result)

spark.stop()
OutputSuccess
Important Notes

Broadcast variables are read-only. You cannot change their value after broadcasting.

Use broadcast variables for small data that fits in memory on each worker.

Broadcasting large data can cause memory issues on workers.

Summary

Broadcast variables share data efficiently across Spark workers.

They reduce network traffic by sending data only once.

Use them for small, read-only data needed by many tasks.