0
0
Apache Sparkdata~5 mins

Accumulator variables in Apache Spark

Choose your learning style9 modes available
Introduction

Accumulator variables help you count or sum things across many computers working together.

Counting how many errors happen while processing big data.
Summing values from many parts of data in a distributed system.
Tracking progress or statistics during a Spark job.
Debugging by counting how often a condition occurs in data.
Syntax
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
accum = spark.sparkContext.accumulator(0)

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

rdd.foreach(lambda x: accum.add(x))

print(accum.value)

Use spark.sparkContext.accumulator(initial_value) to create an accumulator.

Use accum.add(value) inside actions like foreach to update it.

Examples
Counts how many items are in the RDD by adding 1 for each element.
Apache Spark
accum = spark.sparkContext.accumulator(0)
rdd.foreach(lambda x: accum.add(1))
print(accum.value)
Sums all the numbers in the RDD.
Apache Spark
accum = spark.sparkContext.accumulator(0)
rdd.foreach(lambda x: accum.add(x))
print(accum.value)
Counts how many even numbers are in the RDD.
Apache Spark
accum = spark.sparkContext.accumulator(0)
rdd.filter(lambda x: x % 2 == 0).foreach(lambda x: accum.add(1))
print(accum.value)
Sample Program

This program counts how many times the word 'error' appears in the data using an accumulator.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('AccumulatorExample').getOrCreate()

# Create an accumulator starting at 0
error_count = spark.sparkContext.accumulator(0)

# Sample data with some 'error' strings
data = ['ok', 'error', 'ok', 'error', 'error', 'ok']
rdd = spark.sparkContext.parallelize(data)

# Increase accumulator for each 'error'
rdd.foreach(lambda x: error_count.add(1) if x == 'error' else None)

print(f"Number of errors: {error_count.value}")

spark.stop()
OutputSuccess
Important Notes

Accumulators only update correctly inside actions like foreach, not transformations.

Accumulator values are only reliable after the action completes.

They are write-only in tasks and readable only on the driver program.

Summary

Accumulator variables help count or sum values across many computers.

Create them with sparkContext.accumulator() and update inside actions.

Use them to track progress, errors, or statistics during Spark jobs.