Filling missing values with fillna() in Pandas - Time & Space Complexity
We want to understand how the time needed to fill missing values changes as the data grows.
How does the work grow when we use fillna() on bigger data?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [1, None, 3, None, 5],
'B': [None, 2, None, 4, 5]
})
filled_df = df.fillna(0)
This code replaces all missing values in the DataFrame with zero.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Checking each cell in the DataFrame to see if it is missing and replacing it if so.
- How many times: Once for every cell in the DataFrame (rows x columns).
As the number of rows or columns grows, the number of cells to check grows too.
| Input Size (rows x columns) | Approx. Operations |
|---|---|
| 10 x 2 = 20 | About 20 checks and replacements |
| 100 x 2 = 200 | About 200 checks and replacements |
| 1000 x 2 = 2000 | About 2000 checks and replacements |
Pattern observation: The work grows directly with the number of cells; doubling the data doubles the work.
Time Complexity: O(n)
This means the time to fill missing values grows linearly with the total number of cells in the DataFrame.
[X] Wrong: "fillna() only checks missing values, so it runs faster than looking at every cell."
[OK] Correct: The method must look at every cell to know if it is missing or not, so it still touches all data points.
Understanding how data size affects operations like filling missing values helps you write efficient data cleaning steps in real projects.
"What if we used fillna() only on one column instead of the whole DataFrame? How would the time complexity change?"