0
0
Apache Sparkdata~10 mins

Windowed aggregations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Windowed aggregations
Start with DataFrame
Define Window Spec
Apply Aggregation over Window
Add Result as New Column
View Resulting DataFrame
Windowed aggregations apply calculations over a sliding group of rows defined by a window, adding results as new columns.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

spark = SparkSession.builder.getOrCreate()
data = [(1, 'A', 10), (2, 'A', 20), (3, 'A', 30), (4, 'B', 40), (5, 'B', 50)]
df = spark.createDataFrame(data, ['id', 'group', 'value'])

windowSpec = Window.partitionBy('group').orderBy('id').rowsBetween(-1, 0)
df = df.withColumn('rolling_sum', sum('value').over(windowSpec))
df.show()
This code calculates a rolling sum of 'value' over the current and previous row within each 'group'.
Execution Table
StepRow (id, group, value)Window Rows IncludedAggregation (sum)New Column 'rolling_sum'
1(1, A, 10)Rows with id 11010
2(2, A, 20)Rows with id 1, 210 + 20 = 3030
3(3, A, 30)Rows with id 2, 320 + 30 = 5050
4(4, B, 40)Rows with id 44040
5(5, B, 50)Rows with id 4, 540 + 50 = 9090
6End of DataFrame---
💡 All rows processed; rolling sums computed for each row within their group windows.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4After Step 5Final
rolling_sumN/A103050409090
Key Moments - 2 Insights
Why does the rolling sum for id=3 only include values from id=2 and id=3, not id=1?
Because the window is defined with rowsBetween(-1, 0), it includes the current row and one previous row only, so for id=3 it includes id=2 and id=3 rows (see execution_table row 3).
Why are rows partitioned by 'group' before applying the window aggregation?
Partitioning by 'group' means the window aggregation only sums values within the same group, so rows from group 'A' and 'B' are handled separately (see execution_table rows 1-3 vs 4-5).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table at Step 2. What is the rolling_sum value for row with id=2?
A30
B10
C20
D50
💡 Hint
Check the 'New Column rolling_sum' value in execution_table row 2.
At which step does the rolling_sum first include two rows in its window?
AStep 1
BStep 4
CStep 2
DStep 5
💡 Hint
Look at the 'Window Rows Included' column in execution_table to see when two rows are included.
If the window was changed to rowsBetween(-2, 0), what would be the rolling_sum for id=3?
A50
B60
C30
D10
💡 Hint
With rowsBetween(-2, 0), the window includes current and two previous rows; sum values for id=1,2,3.
Concept Snapshot
Windowed aggregations in Spark:
- Define a window with partitionBy and orderBy
- Use rowsBetween to set frame (e.g., current and previous rows)
- Apply aggregation function over window (sum, avg, etc.)
- Result added as new column
- Useful for rolling calculations within groups
Full Transcript
Windowed aggregations in Apache Spark let you calculate values like sums or averages over a sliding window of rows. You start with a DataFrame, define a window specification that partitions data into groups and orders rows. Then you apply an aggregation function over this window, which can include the current row and some number of previous or following rows. The result is added as a new column to the DataFrame. For example, a rolling sum over the current and previous row within each group shows how values accumulate step-by-step. This method helps analyze trends and patterns within grouped data.