Regex operations in Pandas - Time & Space Complexity
When we use regex operations in pandas, we want to know how the time it takes changes as our data grows.
We ask: How does searching or matching patterns slow down when we have more data?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'text': ['apple123', 'banana456', 'cherry789', 'date012'] * 1000
})
matches = df['text'].str.contains(r'\d{3}')
This code checks each string in the 'text' column to see if it contains three digits in a row.
- Primary operation: Applying the regex pattern to each string in the column.
- How many times: Once for every row in the DataFrame.
As the number of rows grows, the total work grows roughly in the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 regex checks |
| 100 | About 100 regex checks |
| 1000 | About 1000 regex checks |
Pattern observation: Doubling the rows roughly doubles the work because each row is checked once.
Time Complexity: O(n)
This means the time grows linearly with the number of rows; more rows mean proportionally more work.
[X] Wrong: "Regex operations take constant time no matter how many rows there are."
[OK] Correct: Each row is checked separately, so more rows mean more checks and more time.
Understanding how regex operations scale helps you explain your code's speed and handle bigger data confidently.
"What if we changed the regex to a more complex pattern that takes longer to match? How would the time complexity change?"