Window functions in Apache Spark - Time & Space Complexity
We want to understand how the time to run window functions changes as data grows.
How does the number of rows affect the work done by window functions?
Analyze the time complexity of the following code snippet.
from pyspark.sql import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy('category').orderBy('value')
df = df.withColumn('row_num', row_number().over(window_spec))
This code adds a row number within each category, ordering rows by value.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: For each row, the window function looks at other rows in the same category to order them.
- How many times: This happens once per row, but ordering within each category involves comparing rows in that category.
As the number of rows grows, the work to order rows in each category grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 rows ordered in groups |
| 100 | More rows, ordering within categories grows roughly with n log n per category |
| 1000 | Ordering cost increases more noticeably as groups get bigger |
Pattern observation: The ordering inside each category grows faster than just the number of rows, because sorting takes more work as groups get larger.
Time Complexity: O(n log m)
This means the time grows with the number of rows times the log of the average group size, due to sorting within each group.
[X] Wrong: "Window functions always run in linear time with the number of rows."
[OK] Correct: Sorting inside each group takes more time than just looking at each row once, so the cost depends on group sizes and sorting.
Understanding how window functions scale helps you explain performance in real data tasks and shows you know how grouping and ordering affect work done.
"What if we removed the orderBy clause from the window specification? How would the time complexity change?"