Interpolation for missing values in Pandas - Time & Space Complexity
We want to understand how the time needed to fill missing data by interpolation changes as the data size grows.
How does the work increase when we have more data points to fill?
Analyze the time complexity of the following code snippet.
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
n = 1000
values = np.random.rand(n)
values[::10] = np.nan # Insert NaNs every 10 steps
df = pd.DataFrame({'A': values})
# Interpolate missing values
result = df['A'].interpolate(method='linear')
This code creates a column with missing values and fills them by linear interpolation.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning the column to find missing values and calculating interpolated values between known points.
- How many times: The operation runs roughly once over the entire data length, processing each element to fill gaps.
As the number of data points grows, the time to interpolate grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 steps to check and fill missing values |
| 100 | About 100 steps |
| 1000 | About 1000 steps |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to interpolate grows linearly with the number of data points.
[X] Wrong: "Interpolation time grows faster than the data size because it does complex calculations for each missing value."
[OK] Correct: Linear interpolation only looks at neighbors and fills gaps in one pass, so the work grows steadily, not faster.
Understanding how interpolation scales helps you explain data cleaning steps clearly and shows you can think about efficiency in real data tasks.
"What if we changed the interpolation method to polynomial of degree 3? How would the time complexity change?"