GroupBy and aggregations in Apache Spark - Time & Space Complexity
When we group data and calculate summaries, the time it takes depends on how much data we have.
We want to know how the work grows as the data gets bigger.
Analyze the time complexity of the following code snippet.
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True, inferSchema=True)
result = df.groupBy('category').agg(avg('value').alias('average_value'))
result.show()
This code groups rows by the 'category' column and calculates the average of 'value' for each group.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning all rows to assign them to groups and then computing averages per group.
- How many times: Each row is processed once during grouping; then each group is processed once for aggregation.
As the number of rows grows, the time to group and aggregate grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations to assign groups + a few for averages |
| 100 | About 100 operations to assign groups + a few for averages |
| 1000 | About 1000 operations to assign groups + a few for averages |
Pattern observation: The work grows roughly linearly with the number of rows.
Time Complexity: O(n)
This means the time to group and aggregate grows in a straight line as the data size grows.
[X] Wrong: "Grouping always takes constant time no matter how big the data is."
[OK] Correct: Grouping must look at each row to decide its group, so more data means more work.
Understanding how grouping and aggregation scale helps you explain data processing steps clearly and confidently.
"What if we added a second grouping column? How would the time complexity change?"