diff() for differences in Pandas - Time & Space Complexity
We want to understand how the time to find differences between rows grows as the data gets bigger.
How does the work change when we have more rows in a DataFrame?
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 10 # Example value for n
df = pd.DataFrame({
'values': range(1, n+1)
})
diff_series = df['values'].diff()
This code creates a DataFrame with a column of numbers and calculates the difference between each row and the previous row.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: pandas goes through each row once to subtract the previous row's value.
- How many times: It does this for every row except the first one, so roughly n-1 times for n rows.
As the number of rows grows, the number of difference calculations grows about the same.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 9 |
| 100 | 99 |
| 1000 | 999 |
Pattern observation: The operations grow roughly in a straight line with the number of rows.
Time Complexity: O(n)
This means the time to compute differences grows directly in proportion to the number of rows.
[X] Wrong: "Calculating differences takes the same time no matter how many rows there are."
[OK] Correct: Each row needs to be checked and subtracted from the previous one, so more rows mean more work.
Knowing how operations like diff() scale helps you explain your code's efficiency clearly and confidently in real projects.
"What if we used diff() on multiple columns at once? How would the time complexity change?"