str.contains() for pattern matching in Pandas - Time & Space Complexity
We want to understand how the time needed to find patterns in text grows as the data gets bigger.
How does using str.contains() on many text entries affect the time it takes?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.Series(["apple", "banana", "apricot", "cherry", "pineapple"] * 1000)
pattern = "app"
matches = data.str.contains(pattern)
This code checks each string in the series to see if it contains the pattern "app".
- Primary operation: Checking each string for the pattern using
str.contains(). - How many times: Once for each string in the series (n times).
Each string is checked one by one. If there are more strings, the work grows in a straight line.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 pattern checks |
| 100 | 100 pattern checks |
| 1000 | 1000 pattern checks |
Pattern observation: Doubling the number of strings roughly doubles the work.
Time Complexity: O(n * m)
This means the time grows with the number of strings (n) and the average length of each string (m).
[X] Wrong: "The time only depends on the number of strings, not their length."
[OK] Correct: Each string's length affects how long the pattern search takes, so longer strings mean more work per check.
Understanding how pattern matching scales helps you explain and improve text search tasks in real projects.
"What if we used a compiled regular expression with str.contains()? How would the time complexity change?"