Broadcast variables help share data efficiently across all worker nodes in a Spark cluster without sending it multiple times.
0
0
Broadcast variables in Apache Spark
Introduction
When you have a small lookup table that all tasks need to access.
When you want to avoid sending the same data repeatedly to each worker.
When you want to improve performance by reducing network traffic.
When you need to share read-only data across multiple stages of a job.
Syntax
Apache Spark
broadcastVar = sc.broadcast(value)
sc is the SparkContext.
value is the data you want to share (like a list or dictionary).
Examples
This creates a broadcast variable from a dictionary for fast access on all workers.
Apache Spark
lookup = {'a': 1, 'b': 2, 'c': 3}
broadcastVar = sc.broadcast(lookup)Broadcast a list of numbers to use in your Spark tasks.
Apache Spark
numbers = [10, 20, 30] broadcastVar = sc.broadcast(numbers)
Sample Program
This program broadcasts a dictionary of fruit counts. It then maps over a list of fruits, replacing each fruit with its count from the broadcast variable. If the fruit is not in the dictionary, it returns 0.
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('BroadcastExample').getOrCreate() sc = spark.sparkContext # Data to broadcast lookup = {'apple': 1, 'banana': 2, 'cherry': 3} # Create broadcast variable broadcastVar = sc.broadcast(lookup) # Sample RDD fruits = sc.parallelize(['apple', 'banana', 'apple', 'cherry', 'banana', 'date']) # Use broadcast variable in map result = fruits.map(lambda fruit: (fruit, broadcastVar.value.get(fruit, 0))).collect() print(result) spark.stop()
OutputSuccess
Important Notes
Broadcast variables are read-only. You cannot change their value after broadcasting.
Use broadcast variables for small data that fits in memory on each worker.
Broadcasting large data can cause memory issues on workers.
Summary
Broadcast variables share data efficiently across Spark workers.
They reduce network traffic by sending data only once.
Use them for small, read-only data needed by many tasks.