Why data cleaning consumes most analysis time in Data Analysis Python - Performance Analysis
Data cleaning often takes the longest in data analysis. We want to understand why it costs so much time as data size grows.
How does the time needed to clean data change when we get more data?
Analyze the time complexity of the following code snippet.
import pandas as pd
def clean_data(df):
df = df.drop_duplicates()
df = df.fillna(method='ffill')
df['column'] = df['column'].apply(lambda x: x.strip() if isinstance(x, str) else x)
return df
This code removes duplicates, fills missing values, and cleans text in one column.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning all rows multiple times for duplicates, missing values, and text cleaning.
- How many times: Each operation goes through the entire dataset once or more.
As the number of rows grows, each cleaning step takes longer because it checks every row.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 operations (3 passes over 10 rows) |
| 100 | About 300 operations (3 passes over 100 rows) |
| 1000 | About 3000 operations (3 passes over 1000 rows) |
Pattern observation: The time grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the cleaning time grows linearly with the amount of data.
[X] Wrong: "Data cleaning time stays the same no matter how much data there is."
[OK] Correct: Cleaning checks every row, so more data means more work and more time.
Understanding how data cleaning scales helps you explain why it takes so long and shows you can think about real data challenges clearly.
"What if we added nested loops to compare every row with every other row during cleaning? How would the time complexity change?"