Reduce and aggregate actions help you combine many data points into a single summary value. This makes it easier to understand big data by getting totals, averages, or other summaries.
Reduce and aggregate actions in Apache Spark
rdd.reduce(func) rdd.aggregate(zeroValue, seqOp, combOp)
reduce combines all elements using a function that takes two inputs and returns one output.
aggregate is more flexible: it uses a starting value, a function to combine data within partitions, and another to combine results across partitions.
reduce.numbers = sc.parallelize([1, 2, 3, 4]) sum = numbers.reduce(lambda a, b: a + b)
aggregate with a zero start and two functions.numbers = sc.parallelize([1, 2, 3, 4]) zero = 0 seqOp = (lambda acc, x: acc + x) combOp = (lambda acc1, acc2: acc1 + acc2) sum = numbers.aggregate(zero, seqOp, combOp)
aggregateByKey after mapping to (word, 1) pairs.words = sc.parallelize(['apple', 'banana', 'apple']) word_pairs = words.map(lambda w: (w, 1)) count = word_pairs.aggregateByKey(0, lambda acc, v: acc + v, lambda acc1, acc2: acc1 + acc2).collect()
This program creates a list of numbers as an RDD. It then finds the sum of all numbers using both reduce and aggregate. Both methods give the same result.
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[*]').appName('ReduceAggregateExample').getOrCreate() sc = spark.sparkContext # Create an RDD of numbers numbers = sc.parallelize([10, 20, 30, 40, 50]) # Use reduce to find the sum sum_reduce = numbers.reduce(lambda a, b: a + b) # Use aggregate to find the sum with zero start zero = 0 seqOp = lambda acc, x: acc + x combOp = lambda acc1, acc2: acc1 + acc2 sum_aggregate = numbers.aggregate(zero, seqOp, combOp) print(f"Sum using reduce: {sum_reduce}") print(f"Sum using aggregate: {sum_aggregate}") spark.stop()
reduce requires a function that combines two elements into one.
aggregate is useful when you want to start from a specific value or do more complex combining.
Both actions trigger computation in Spark and return results to the driver program.
Reduce and aggregate actions combine many data points into one summary value.
Use reduce for simple combining with one function.
Use aggregate for more control with a starting value and two functions.