Why systematic cleaning matters in Pandas - Performance Analysis
We want to see how the time it takes to clean data grows as the data gets bigger.
How does cleaning many rows affect how long the process takes?
Analyze the time complexity of the following code snippet.
import pandas as pd
def clean_data(df):
df = df.dropna()
df['age'] = df['age'].astype(int)
df = df[df['salary'] > 0]
df['name'] = df['name'].str.strip()
return df
# Example usage:
# cleaned_df = clean_data(raw_df)
This code removes missing data, converts ages to integers, filters positive salaries, and trims spaces from names.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Going through each row to check for missing values, convert types, filter, and trim strings.
- How many times: Each operation touches all rows once, so roughly once per row per step.
As the number of rows grows, the time to clean grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 40 (4 steps x 10 rows) |
| 100 | About 400 (4 steps x 100 rows) |
| 1000 | About 4000 (4 steps x 1000 rows) |
Pattern observation: Doubling the rows roughly doubles the work needed.
Time Complexity: O(n)
This means the cleaning time grows linearly with the number of rows in the data.
[X] Wrong: "Cleaning a few columns is always very fast, no matter how many rows there are."
[OK] Correct: Even if you clean only a few columns, every row still needs to be checked, so time grows with the number of rows.
Understanding how cleaning scales helps you explain your approach clearly and shows you think about efficiency in real data tasks.
"What if we added a nested loop to clean data by comparing each row to every other row? How would the time complexity change?"