Apache Sparkdata~10 mins

Accumulator variables in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Accumulator variables

Start Spark Context

↓

Create Accumulator with initial value 0

↓

Run distributed tasks

↓

Each task adds to accumulator

↓

Accumulator updates safely across nodes

↓

Collect final accumulator value

↓

Use accumulator result for analysis

Accumulator variables start at zero and safely collect sums or counts from many distributed tasks, then provide a final combined result.

Execution Sample

Apache Spark

acc = sc.accumulator(0)
rdd = sc.parallelize([1,2,3,4])
rdd.foreach(lambda x: acc.add(x))
print(acc.value)

This code creates an accumulator, adds each RDD element to it in parallel, then prints the total sum.

Execution Table

Step	Action	Accumulator Value (Local Task)	Accumulator Value (Driver)
1	Create accumulator with initial value	0	0
2	Start task 1: process element 1, add 1	1	0
3	Start task 2: process element 2, add 2	2	0
4	Start task 3: process element 3, add 3	3	0
5	Start task 4: process element 4, add 4	4	0
6	Tasks complete, driver collects updates	N/A	10
7	Print accumulator value	N/A	10

💡 All tasks finished, accumulator value collected at driver as 10 (sum of 1+2+3+4)

Variable Tracker

Variable	Start	After Task 1	After Task 2	After Task 3	After Task 4	Final
accumulator	0	1	3	6	10	10

Key Moments - 2 Insights

Why does the accumulator value on the driver stay 0 until all tasks finish?

Can accumulator values be read inside tasks during execution?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the accumulator value on the driver after task 3 completes but before task 4?

DN/A

Concept Snapshot

Accumulator variables in Spark:
- Created with sc.accumulator(initial_value)
- Used to sum or count across distributed tasks
- Tasks add to local accumulator copies
- Driver collects final combined value after tasks finish
- Accumulators are write-only inside tasks
- Useful for debugging and metrics in distributed jobs

Full Transcript

Accumulator variables in Apache Spark start at an initial value, usually zero. When you run tasks on distributed data, each task adds to its local copy of the accumulator. These local updates do not immediately change the driver's accumulator value. After all tasks complete, Spark safely combines all local accumulator values and updates the driver's accumulator. This final value can then be read and used for analysis or debugging. Accumulators are write-only inside tasks to avoid inconsistent reads. This example showed summing numbers in an RDD using an accumulator, resulting in the total sum collected at the driver after all tasks finish.