0
0
Apache Sparkdata~10 mins

Accumulator variables in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Accumulator variables
Start Spark Context
Create Accumulator with initial value 0
Run distributed tasks
Each task adds to accumulator
Accumulator updates safely across nodes
Collect final accumulator value
Use accumulator result for analysis
Accumulator variables start at zero and safely collect sums or counts from many distributed tasks, then provide a final combined result.
Execution Sample
Apache Spark
acc = sc.accumulator(0)
rdd = sc.parallelize([1,2,3,4])
rdd.foreach(lambda x: acc.add(x))
print(acc.value)
This code creates an accumulator, adds each RDD element to it in parallel, then prints the total sum.
Execution Table
StepActionAccumulator Value (Local Task)Accumulator Value (Driver)
1Create accumulator with initial value00
2Start task 1: process element 1, add 110
3Start task 2: process element 2, add 220
4Start task 3: process element 3, add 330
5Start task 4: process element 4, add 440
6Tasks complete, driver collects updatesN/A10
7Print accumulator valueN/A10
💡 All tasks finished, accumulator value collected at driver as 10 (sum of 1+2+3+4)
Variable Tracker
VariableStartAfter Task 1After Task 2After Task 3After Task 4Final
accumulator01361010
Key Moments - 2 Insights
Why does the accumulator value on the driver stay 0 until all tasks finish?
Because each task updates its local copy of the accumulator. The driver only sees the combined result after all tasks complete, as shown in execution_table rows 2-6.
Can accumulator values be read inside tasks during execution?
No, accumulator values are only reliably read on the driver after tasks finish. Inside tasks, the accumulator is write-only, preventing inconsistent reads.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the accumulator value on the driver after task 3 completes but before task 4?
A3
B0
C6
DN/A
💡 Hint
Check the 'Accumulator Value (Driver)' column in rows 2 to 5; driver updates only after all tasks finish.
At which step does the driver finally get the total accumulator value?
AStep 4
BStep 5
CStep 6
DStep 7
💡 Hint
Look at the 'Action' column for when driver collects updates (row 6).
If the RDD had elements [1,2,3] instead, what would be the final accumulator value?
A6
B10
C0
D9
💡 Hint
Sum the elements in the RDD and compare with the final accumulator value in variable_tracker.
Concept Snapshot
Accumulator variables in Spark:
- Created with sc.accumulator(initial_value)
- Used to sum or count across distributed tasks
- Tasks add to local accumulator copies
- Driver collects final combined value after tasks finish
- Accumulators are write-only inside tasks
- Useful for debugging and metrics in distributed jobs
Full Transcript
Accumulator variables in Apache Spark start at an initial value, usually zero. When you run tasks on distributed data, each task adds to its local copy of the accumulator. These local updates do not immediately change the driver's accumulator value. After all tasks complete, Spark safely combines all local accumulator values and updates the driver's accumulator. This final value can then be read and used for analysis or debugging. Accumulators are write-only inside tasks to avoid inconsistent reads. This example showed summing numbers in an RDD using an accumulator, resulting in the total sum collected at the driver after all tasks finish.