Handling inconsistent values in Pandas - Time & Space Complexity
When cleaning data, fixing inconsistent values is common. We want to know how the time needed changes as data grows.
How does the work increase when we handle more rows with inconsistent values?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'color': ['Red', 'red', 'RED', 'Blue', 'blue', 'BLUE'] * 1000
})
data['color_clean'] = data['color'].str.lower()
This code fixes inconsistent capitalization by converting all color names to lowercase.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Applying the
str.lower()method to each string in the column. - How many times: Once for every row in the DataFrame.
Each new row adds one more string to convert to lowercase, so the work grows steadily with the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 lowercase conversions |
| 100 | 100 lowercase conversions |
| 1000 | 1000 lowercase conversions |
Pattern observation: The work grows directly in proportion to the number of rows.
Time Complexity: O(n)
This means the time to fix inconsistent values grows linearly as the data size grows.
[X] Wrong: "Fixing inconsistent values takes the same time no matter how many rows there are."
[OK] Correct: Each row needs to be checked and fixed, so more rows mean more work and more time.
Understanding how data cleaning steps scale helps you explain your approach clearly and shows you think about efficiency in real projects.
"What if we used a function that checks and replaces values only if they are inconsistent? How would the time complexity change?"