0
0
Apache Sparkdata~10 mins

Broadcast variables in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Broadcast variables
Create large read-only variable
Broadcast variable to all worker nodes
Workers receive broadcast variable
Workers use broadcast variable in tasks
Tasks execute efficiently without sending full data each time
Job completes
Broadcast variables are created once and sent to all worker nodes to efficiently share large read-only data during distributed tasks.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [1, 2, 3, 4]
bc_var = spark.sparkContext.broadcast(data)
result = spark.sparkContext.parallelize([10, 20]).map(lambda x: x + sum(bc_var.value)).collect()
This code broadcasts a list to all workers and uses it in a map operation to add the sum of the list to each element.
Execution Table
StepActionBroadcast Variable StateTask InputTask Output
1Create list data[1, 2, 3, 4]N/AN/A
2Broadcast data to workers[1, 2, 3, 4] broadcastedN/AN/A
3Parallelize input [10, 20][1, 2, 3, 4] broadcasted[10, 20]N/A
4Map: For 10, add sum(bc_var.value)=10[1, 2, 3, 4] broadcasted1020
5Map: For 20, add sum(bc_var.value)=10[1, 2, 3, 4] broadcasted2030
6Collect results[1, 2, 3, 4] broadcasted[10, 20][20, 30]
💡 All tasks completed using broadcast variable without re-sending data.
Variable Tracker
VariableStartAfter Step 2After Step 6
dataN/A[1, 2, 3, 4][1, 2, 3, 4]
bc_var.valueN/A[1, 2, 3, 4][1, 2, 3, 4]
resultN/AN/A[20, 30]
Key Moments - 2 Insights
Why do we broadcast the variable instead of sending it with each task?
Broadcasting sends the variable once to all workers, so tasks reuse it without repeated data transfer, as shown in steps 2 and 4-5 of the execution table.
Can the broadcast variable be changed after broadcasting?
No, broadcast variables are read-only after creation. The execution table shows the broadcast variable state stays the same from step 2 to 6.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the output of the map operation for input 10 at step 4?
A30
B10
C20
D40
💡 Hint
Check the 'Task Output' column at step 4 in the execution table.
At which step is the broadcast variable first sent to the workers?
AStep 2
BStep 1
CStep 3
DStep 4
💡 Hint
Look at the 'Broadcast Variable State' column to see when it changes to 'broadcasted'.
If the broadcast variable was not used, how would the tasks behave?
ATasks would run faster because no broadcast is needed.
BEach task would receive the full data separately, increasing network traffic.
CTasks would fail because data is missing.
DTasks would ignore the data and produce the same output.
💡 Hint
Recall the purpose of broadcast variables shown in the concept flow and key moments.
Concept Snapshot
Broadcast variables in Spark:
- Created once and sent to all workers
- Used to share large read-only data efficiently
- Avoid repeated data transfer in tasks
- Accessed via bc_var.value
- Immutable after broadcasting
Full Transcript
Broadcast variables are a way to share large read-only data efficiently in Apache Spark. First, you create the variable on the driver. Then Spark sends it once to all worker nodes. Workers use this broadcasted data in their tasks without needing to receive it repeatedly. This saves network traffic and speeds up jobs. The broadcast variable cannot be changed after sending. In the example, a list is broadcasted and each task adds the sum of the list to its input. The execution table shows each step from creating the data, broadcasting, running tasks, to collecting results. This helps beginners see how broadcast variables improve distributed computing.