Forward fill and backward fill in Pandas - Time & Space Complexity
We want to understand how the time it takes to fill missing data grows as the data size increases.
Specifically, how does forward fill or backward fill behave when applied to larger data?
Analyze the time complexity of the following code snippet.
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
n = 1000
data = pd.DataFrame({
'A': np.random.choice([1, np.nan], size=n)
})
# Forward fill missing values
filled = data.ffill()
# Backward fill missing values
filled_back = data.bfill()
This code creates a column with some missing values and fills them forward and backward.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning the column values once to fill missing data.
- How many times: Each element is visited once in order for forward fill, and once for backward fill.
As the number of rows increases, the time to fill missing values grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 steps to fill |
| 100 | About 100 steps to fill |
| 1000 | About 1000 steps to fill |
Pattern observation: The work grows linearly as the data size grows.
Time Complexity: O(n)
This means the time to fill missing values grows directly with the number of rows.
[X] Wrong: "Forward fill checks every previous value multiple times, so it takes much longer than linear time."
[OK] Correct: The fill methods scan the data only once in order, so each element is visited once, making it linear time.
Understanding how filling missing data scales helps you handle larger datasets efficiently and shows you know how pandas works under the hood.
What if we applied forward fill on multiple columns at once? How would the time complexity change?