What is Accumulator variables in Apache Spark?

Apache Sparkdata~5 mins

Accumulator variables in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Accumulator variables help you count or sum things across many computers working together.

Counting how many errors happen while processing big data.

Summing values from many parts of data in a distributed system.

Tracking progress or statistics during a Spark job.

Debugging by counting how often a condition occurs in data.

Syntax

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
accum = spark.sparkContext.accumulator(0)

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

rdd.foreach(lambda x: accum.add(x))

print(accum.value)

Use spark.sparkContext.accumulator(initial_value) to create an accumulator.

Use accum.add(value) inside actions like foreach to update it.

Examples

Counts how many items are in the RDD by adding 1 for each element.

Apache Spark

accum = spark.sparkContext.accumulator(0)
rdd.foreach(lambda x: accum.add(1))
print(accum.value)

Sums all the numbers in the RDD.

Apache Spark

accum = spark.sparkContext.accumulator(0)
rdd.foreach(lambda x: accum.add(x))
print(accum.value)

Counts how many even numbers are in the RDD.

Apache Spark

accum = spark.sparkContext.accumulator(0)
rdd.filter(lambda x: x % 2 == 0).foreach(lambda x: accum.add(1))
print(accum.value)

Sample Program

This program counts how many times the word 'error' appears in the data using an accumulator.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('AccumulatorExample').getOrCreate()

# Create an accumulator starting at 0
error_count = spark.sparkContext.accumulator(0)

# Sample data with some 'error' strings
data = ['ok', 'error', 'ok', 'error', 'error', 'ok']
rdd = spark.sparkContext.parallelize(data)

# Increase accumulator for each 'error'
rdd.foreach(lambda x: error_count.add(1) if x == 'error' else None)

print(f"Number of errors: {error_count.value}")

spark.stop()

OutputSuccess

Important Notes

Accumulators only update correctly inside actions like foreach, not transformations.

Accumulator values are only reliable after the action completes.

They are write-only in tasks and readable only on the driver program.

Summary

Accumulator variables help count or sum values across many computers.

Create them with sparkContext.accumulator() and update inside actions.

Use them to track progress, errors, or statistics during Spark jobs.