Dropping missing values with dropna() in Pandas - Time & Space Complexity
We want to understand how the time to remove missing values grows as the data gets bigger.
How does the work change when we have more rows in our data?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'A': [1, 2, None, 4],
'B': [None, 2, 3, 4],
'C': [1, None, 3, 4]
})
clean_data = data.dropna()
This code creates a small table with some missing values and removes all rows that have any missing value.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Checking each cell in the table to find missing values.
- How many times: Once for every cell in the data (rows x columns).
As the number of rows grows, the work grows roughly in direct proportion to the number of cells.
| Input Size (rows) | Approx. Operations (cells checked) |
|---|---|
| 10 | 10 x columns |
| 100 | 100 x columns |
| 1000 | 1000 x columns |
Pattern observation: Doubling rows doubles the work, since every cell is checked once.
Time Complexity: O(n * m)
This means the time grows proportionally with the number of rows (n) times the number of columns (m).
[X] Wrong: "dropna() only looks at rows, so time grows with rows only."
[OK] Correct: dropna() must check every cell to find missing values, so columns also affect time.
Understanding how data cleaning steps like dropna() scale helps you explain your code choices clearly and confidently.
"What if we used dropna(axis=1) to drop columns with missing values? How would the time complexity change?"