Aggregation-based features in Data Analysis Python - Time & Space Complexity
When creating aggregation-based features, we combine data by groups to get summaries like sums or averages.
We want to know how the time to do this grows as the data gets bigger.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 2, 1, 3, 2, 1],
'purchase_amount': [100, 200, 150, 300, 250, 50]
})
agg_features = df.groupby('user_id')['purchase_amount'].sum().reset_index()
This code groups data by user_id and sums the purchase_amount for each user.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Grouping rows by user_id and summing purchase_amount values.
- How many times: Each row is visited once to assign it to a group, then each group is processed to sum values.
As the number of rows grows, the time to group and sum grows roughly in a straight line.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 visits to rows plus grouping steps |
| 100 | About 100 visits to rows plus grouping steps |
| 1000 | About 1000 visits to rows plus grouping steps |
Pattern observation: The work grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to create aggregation features grows linearly with the number of data rows.
[X] Wrong: "Grouping and summing takes the same time no matter how many rows there are."
[OK] Correct: More rows mean more data to process, so the time grows as the data grows.
Understanding how aggregation scales helps you explain data processing steps clearly and shows you can think about efficiency.
"What if we added a nested loop to compute pairwise differences within each group? How would the time complexity change?"