replace() for value substitution in Pandas - Time & Space Complexity
We want to understand how the time needed to replace values in a pandas DataFrame changes as the data grows.
How does the replace() method scale when changing many values?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': ['cat', 'dog', 'bird', 'cat', 'dog'],
'B': ['red', 'blue', 'red', 'green', 'blue']
})
# Replace 'cat' with 'lion' and 'blue' with 'cyan'
df_replaced = df.replace({'cat': 'lion', 'blue': 'cyan'})
This code replaces specific values in the DataFrame with new ones.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: pandas scans each cell in the DataFrame to check if it matches any value to replace.
- How many times: Once for each cell, so total cells = number of rows x number of columns.
As the DataFrame grows, the number of cells to check grows too.
| Input Size (rows x columns) | Approx. Operations |
|---|---|
| 10 x 2 = 20 | About 20 checks |
| 100 x 2 = 200 | About 200 checks |
| 1000 x 2 = 2000 | About 2000 checks |
Pattern observation: The operations grow roughly in direct proportion to the number of cells.
Time Complexity: O(n × m)
This means the time to replace values grows linearly with the total number of cells in the DataFrame.
[X] Wrong: "replace() only checks the columns where replacements are specified, so it runs faster than scanning the whole DataFrame."
[OK] Correct: pandas replace() checks every cell because it does not know where the values appear; it must scan all data to find matches.
Understanding how replace() scales helps you explain data cleaning steps clearly and shows you know how data size affects performance.
What if we replaced values only in one column instead of the whole DataFrame? How would the time complexity change?