What is Reduce and aggregate actions in Apache Spark?

Apache Sparkdata~5 mins

Reduce and aggregate actions in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Reduce and aggregate actions help you combine many data points into a single summary value. This makes it easier to understand big data by getting totals, averages, or other summaries.

You want to find the total sales from a list of daily sales numbers.

You need to calculate the average temperature from many sensor readings.

You want to count how many times each word appears in a large text.

You want to find the maximum or minimum value in a dataset.

You want to combine all data points into one result for reporting.

Syntax

Apache Spark

rdd.reduce(func)
rdd.aggregate(zeroValue, seqOp, combOp)

reduce combines all elements using a function that takes two inputs and returns one output.

aggregate is more flexible: it uses a starting value, a function to combine data within partitions, and another to combine results across partitions.

Examples

This adds all numbers together using reduce.

Apache Spark

numbers = sc.parallelize([1, 2, 3, 4])
sum = numbers.reduce(lambda a, b: a + b)

This also sums numbers but uses aggregate with a zero start and two functions.

Apache Spark

numbers = sc.parallelize([1, 2, 3, 4])
zero = 0
seqOp = (lambda acc, x: acc + x)
combOp = (lambda acc1, acc2: acc1 + acc2)
sum = numbers.aggregate(zero, seqOp, combOp)

This counts occurrences of each word using aggregateByKey after mapping to (word, 1) pairs.

Apache Spark

words = sc.parallelize(['apple', 'banana', 'apple'])
word_pairs = words.map(lambda w: (w, 1))
count = word_pairs.aggregateByKey(0, lambda acc, v: acc + v, lambda acc1, acc2: acc1 + acc2).collect()

Sample Program

This program creates a list of numbers as an RDD. It then finds the sum of all numbers using both reduce and aggregate. Both methods give the same result.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('ReduceAggregateExample').getOrCreate()
sc = spark.sparkContext

# Create an RDD of numbers
numbers = sc.parallelize([10, 20, 30, 40, 50])

# Use reduce to find the sum
sum_reduce = numbers.reduce(lambda a, b: a + b)

# Use aggregate to find the sum with zero start
zero = 0
seqOp = lambda acc, x: acc + x
combOp = lambda acc1, acc2: acc1 + acc2
sum_aggregate = numbers.aggregate(zero, seqOp, combOp)

print(f"Sum using reduce: {sum_reduce}")
print(f"Sum using aggregate: {sum_aggregate}")

spark.stop()

OutputSuccess

Important Notes

reduce requires a function that combines two elements into one.

aggregate is useful when you want to start from a specific value or do more complex combining.

Both actions trigger computation in Spark and return results to the driver program.

Summary

Reduce and aggregate actions combine many data points into one summary value.

Use reduce for simple combining with one function.

Use aggregate for more control with a starting value and two functions.