Split-apply-combine mental model in Pandas - Time & Space Complexity
When using pandas to group data and then apply calculations, it is important to understand how the time needed grows as the data gets bigger.
We want to know how the time changes when we split data into groups, do work on each group, and then combine the results.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'Category': ['A', 'B', 'A', 'B', 'C', 'A'],
'Value': [10, 20, 30, 40, 50, 60]
})
grouped = df.groupby('Category')
result = grouped['Value'].sum()
This code splits the data by 'Category', sums the 'Value' in each group, and combines the sums into a result.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Going through each row to assign it to a group.
- How many times: Once for each row in the data (n times).
- Secondary operation: Summing values inside each group, which depends on group size.
As the number of rows grows, the time to split and sum grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 steps to assign groups + sum within groups |
| 100 | About 100 steps to assign groups + sum within groups |
| 1000 | About 1000 steps to assign groups + sum within groups |
Pattern observation: The total work grows roughly in a straight line as the data size increases.
Time Complexity: O(n)
This means the time needed grows roughly in direct proportion to the number of rows in the data.
[X] Wrong: "Grouping data is instant and does not depend on data size."
[OK] Correct: Grouping requires looking at every row to decide its group, so it takes more time as data grows.
Understanding how grouping and applying functions scale helps you explain your code choices clearly and shows you know how data size affects performance.
"What if we applied a more complex function instead of sum, like sorting each group? How would the time complexity change?"