Apache Sparkdata~10 mins

Windowed aggregations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Windowed aggregations

Start with DataFrame

↓

Define Window Spec

↓

Apply Aggregation over Window

↓

Add Result as New Column

↓

View Resulting DataFrame

Windowed aggregations apply calculations over a sliding group of rows defined by a window, adding results as new columns.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

spark = SparkSession.builder.getOrCreate()
data = [(1, 'A', 10), (2, 'A', 20), (3, 'A', 30), (4, 'B', 40), (5, 'B', 50)]
df = spark.createDataFrame(data, ['id', 'group', 'value'])

windowSpec = Window.partitionBy('group').orderBy('id').rowsBetween(-1, 0)
df = df.withColumn('rolling_sum', sum('value').over(windowSpec))
df.show()

This code calculates a rolling sum of 'value' over the current and previous row within each 'group'.

Execution Table

Step	Row (id, group, value)	Window Rows Included	Aggregation (sum)	New Column 'rolling_sum'
1	(1, A, 10)	Rows with id 1	10	10
2	(2, A, 20)	Rows with id 1, 2	10 + 20 = 30	30
3	(3, A, 30)	Rows with id 2, 3	20 + 30 = 50	50
4	(4, B, 40)	Rows with id 4	40	40
5	(5, B, 50)	Rows with id 4, 5	40 + 50 = 90	90
6	End of DataFrame	-	-	-

💡 All rows processed; rolling sums computed for each row within their group windows.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	After Step 4	After Step 5	Final
rolling_sum	N/A	10	30	50	40	90	90

Key Moments - 2 Insights

Why does the rolling sum for id=3 only include values from id=2 and id=3, not id=1?

Why are rows partitioned by 'group' before applying the window aggregation?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table at Step 2. What is the rolling_sum value for row with id=2?

A30

B10

C20

D50

Concept Snapshot

Windowed aggregations in Spark:
- Define a window with partitionBy and orderBy
- Use rowsBetween to set frame (e.g., current and previous rows)
- Apply aggregation function over window (sum, avg, etc.)
- Result added as new column
- Useful for rolling calculations within groups

Full Transcript

Windowed aggregations in Apache Spark let you calculate values like sums or averages over a sliding window of rows. You start with a DataFrame, define a window specification that partitions data into groups and orders rows. Then you apply an aggregation function over this window, which can include the current row and some number of previous or following rows. The result is added as a new column to the DataFrame. For example, a rolling sum over the current and previous row within each group shows how values accumulate step-by-step. This method helps analyze trends and patterns within grouped data.