Windowed aggregations in Apache Spark - Time & Space Complexity
When using windowed aggregations in Apache Spark, we want to know how the work grows as the data gets bigger.
We ask: How does Spark handle the data when it groups and calculates over windows?
Analyze the time complexity of the following code snippet.
from pyspark.sql import Window
from pyspark.sql.functions import sum
windowSpec = Window.partitionBy('category').orderBy('date').rowsBetween(-2, 0)
result = df.withColumn('rolling_sum', sum('value').over(windowSpec))
This code calculates a rolling sum of the last 3 rows per category ordered by date.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: For each row, Spark computes the sum over a window of 3 rows.
- How many times: This happens once per row in the dataset.
As the number of rows grows, Spark does a small sum for each row over a fixed window size.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 sums of 3 values each |
| 100 | About 100 sums of 3 values each |
| 1000 | About 1000 sums of 3 values each |
Pattern observation: The total work grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to compute grows linearly with the number of rows in the data.
[X] Wrong: "Windowed aggregations always take quadratic time because they look at multiple rows for each row."
[OK] Correct: The window size is fixed and small, so each row only does a small, constant amount of work, not work growing with total data size squared.
Understanding how windowed aggregations scale helps you explain data processing efficiency clearly and confidently.
"What if the window size grew with the number of rows? How would the time complexity change?"