Missing data strategies decision in Pandas - Time & Space Complexity
When working with missing data in pandas, it is important to know how the time to handle missing values changes as the data grows.
We want to understand how the cost of different missing data strategies grows with the size of the dataset.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, None, 4, None],
'B': [None, 2, 3, None, 5]
})
# Strategy: Drop rows with any missing values
df_clean = df.dropna()
# Strategy: Fill missing values with a constant
df_filled = df.fillna(0)
This code shows two common ways to handle missing data: dropping rows with missing values and filling missing values with a constant.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each row and column to check for missing values.
- How many times: Once for each cell in the DataFrame (rows x columns).
As the number of rows grows, the operations to find and handle missing values increase proportionally.
| Input Size (rows) | Approx. Operations |
|---|---|
| 10 | About 10 x columns checks |
| 100 | About 100 x columns checks |
| 1000 | About 1000 x columns checks |
Pattern observation: The work grows linearly with the number of rows.
Time Complexity: O(n)
This means the time to handle missing data grows roughly in direct proportion to the number of rows in the data.
[X] Wrong: "Handling missing data takes the same time no matter how big the dataset is."
[OK] Correct: The code must check each cell for missing values, so more data means more checks and more time.
Understanding how missing data handling scales helps you explain your data cleaning choices clearly and confidently in real projects.
"What if we used a method that fills missing values based on the mean of each column? How would the time complexity change?"