Extracting with str.extract (regex) in Data Analysis Python - Time & Space Complexity
We want to understand how the time needed to extract text using regex grows as the data size increases.
How does the extraction time change when we have more rows to process?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.Series(['abc123', 'def456', 'ghi789'] * 1000)
pattern = r'(\d+)'
extracted = data.str.extract(pattern)
This code extracts the number part from each string in a pandas Series using a regex pattern.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Applying the regex extraction on each string in the Series.
- How many times: Once per element in the Series, so as many times as the number of rows.
As the number of rows grows, the total work grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 regex extractions |
| 100 | 100 regex extractions |
| 1000 | 1000 regex extractions |
Pattern observation: Doubling the input roughly doubles the work because each string is processed once.
Time Complexity: O(n)
This means the time to extract grows linearly with the number of strings you process.
[X] Wrong: "Using regex extraction is instant no matter how many rows there are."
[OK] Correct: Each row requires running the regex, so more rows mean more work and more time.
Understanding how regex extraction scales helps you explain performance when working with text data in real projects.
"What if the regex pattern was more complex and slower to match? How would that affect the time complexity?"