Apache Sparkdata~10 mins

Broadcast variables in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Broadcast variables

Create large read-only variable

↓

Broadcast variable to all worker nodes

↓

Workers receive broadcast variable

↓

Workers use broadcast variable in tasks

↓

Tasks execute efficiently without sending full data each time

↓

Job completes

Broadcast variables are created once and sent to all worker nodes to efficiently share large read-only data during distributed tasks.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [1, 2, 3, 4]
bc_var = spark.sparkContext.broadcast(data)
result = spark.sparkContext.parallelize([10, 20]).map(lambda x: x + sum(bc_var.value)).collect()

This code broadcasts a list to all workers and uses it in a map operation to add the sum of the list to each element.

Execution Table

Step	Action	Broadcast Variable State	Task Input	Task Output
1	Create list data	[1, 2, 3, 4]	N/A	N/A
2	Broadcast data to workers	[1, 2, 3, 4] broadcasted	N/A	N/A
3	Parallelize input [10, 20]	[1, 2, 3, 4] broadcasted	[10, 20]	N/A
4	Map: For 10, add sum(bc_var.value)=10	[1, 2, 3, 4] broadcasted	10	20
5	Map: For 20, add sum(bc_var.value)=10	[1, 2, 3, 4] broadcasted	20	30
6	Collect results	[1, 2, 3, 4] broadcasted	[10, 20]	[20, 30]

💡 All tasks completed using broadcast variable without re-sending data.

Variable Tracker

Variable	Start	After Step 2	After Step 6
data	N/A	[1, 2, 3, 4]	[1, 2, 3, 4]
bc_var.value	N/A	[1, 2, 3, 4]	[1, 2, 3, 4]
result	N/A	N/A	[20, 30]

Key Moments - 2 Insights

Why do we broadcast the variable instead of sending it with each task?

Can the broadcast variable be changed after broadcasting?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the output of the map operation for input 10 at step 4?

A30

B10

C20

D40

Concept Snapshot

Broadcast variables in Spark:
- Created once and sent to all workers
- Used to share large read-only data efficiently
- Avoid repeated data transfer in tasks
- Accessed via bc_var.value
- Immutable after broadcasting

Full Transcript

Broadcast variables are a way to share large read-only data efficiently in Apache Spark. First, you create the variable on the driver. Then Spark sends it once to all worker nodes. Workers use this broadcasted data in their tasks without needing to receive it repeatedly. This saves network traffic and speeds up jobs. The broadcast variable cannot be changed after sending. In the example, a list is broadcasted and each task adds the sum of the list to its input. The execution table shows each step from creating the data, broadcasting, running tasks, to collecting results. This helps beginners see how broadcast variables improve distributed computing.