str.replace() for substitution in Pandas - Time & Space Complexity
We want to understand how the time it takes to replace text in a pandas column changes as the data grows.
How does the work grow when we replace strings in many rows?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'text': ['apple pie', 'banana split', 'apple tart', 'banana bread'] * 1000
})
# Replace 'apple' with 'orange' in the 'text' column
result = df['text'].str.replace('apple', 'orange', regex=False)
This code replaces the word 'apple' with 'orange' in every string of the 'text' column.
- Primary operation: Checking and replacing the substring in each string of the column.
- How many times: Once for each row in the DataFrame (n times).
As the number of rows grows, the total work grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 string checks and replacements |
| 100 | About 100 string checks and replacements |
| 1000 | About 1000 string checks and replacements |
Pattern observation: Doubling the rows roughly doubles the work.
Time Complexity: O(n * m)
This means the time grows with the number of rows (n) and the length of each string (m) because each string is checked and replaced.
[X] Wrong: "Replacing text in a column is always very fast and does not depend on data size."
[OK] Correct: The operation must look at each string, so more rows or longer strings mean more work and more time.
Understanding how string operations scale helps you write efficient data cleaning code and explain your choices clearly in interviews.
"What if we replaced a regex pattern instead of a fixed string? How would the time complexity change?"