Combining multiple cleaning steps in Pandas - Time & Space Complexity
We want to understand how the time needed changes when we combine several cleaning steps in pandas.
How does the total work grow as the data gets bigger?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': ['foo', 'bar', None, 'baz', 'foo'],
'B': [1, 2, 3, None, 5],
'C': ['x', 'y', 'z', 'x', None]
})
df = df.dropna()
df['A'] = df['A'].str.upper()
df['B'] = df['B'] * 10
This code drops rows with missing values, changes column A to uppercase, and multiplies column B by 10.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Each cleaning step processes the entire DataFrame or columns once.
- How many times: Three main passes: one for dropna, one for uppercasing, one for multiplying.
Each step looks at all rows, so the total work grows roughly in a straight line with the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 (3 steps x 10 rows) |
| 100 | About 300 (3 steps x 100 rows) |
| 1000 | About 3000 (3 steps x 1000 rows) |
Pattern observation: The total work grows linearly as the data size grows.
Time Complexity: O(n)
This means the time needed grows in direct proportion to the number of rows in the data.
[X] Wrong: "Combining multiple steps multiplies the time complexity, making it much slower than each step alone."
[OK] Correct: Each step runs one after another, so the total time adds up, not multiplies. The overall growth stays linear, just with a bigger constant factor.
Understanding how multiple data cleaning steps add up helps you explain your approach clearly and shows you can think about efficiency in real projects.
"What if we combined the cleaning steps into one function that processes all columns at once? How would the time complexity change?"