shift() for lagging data in Pandas - Time & Space Complexity
We want to understand how the time needed to shift data grows as the data size grows.
How does using shift() on a pandas column scale with more rows?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({'value': range(1_000_000)})
df['lagged'] = df['value'].shift(1)
This code creates a DataFrame with one million rows and adds a new column that shifts the original values down by one row.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: pandas internally moves through each row to shift values.
- How many times: Once for each row in the DataFrame.
As the number of rows increases, the time to shift grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 moves |
| 100 | About 100 moves |
| 1000 | About 1000 moves |
Pattern observation: Doubling the rows roughly doubles the work done.
Time Complexity: O(n)
This means the time to shift data grows linearly with the number of rows.
[X] Wrong: "Using shift() is instant and does not depend on data size."
[OK] Correct: Even though shift() is fast, it still processes each row once, so bigger data takes more time.
Understanding how simple operations like shifting scale helps you write efficient data code and explain your choices clearly.
What if we changed shift(1) to shift(1000)? How would the time complexity change?