Text cleaning pipeline in Data Analysis Python - Time & Space Complexity
We want to understand how the time needed to clean text grows as the amount of text grows.
How does the cleaning process scale when we have more words or sentences?
Analyze the time complexity of the following code snippet.
import re
def clean_text(texts):
cleaned = []
for text in texts:
text = text.lower()
text = re.sub(r'[^a-z ]', '', text)
words = text.split()
cleaned.append(words)
return cleaned
This code cleans a list of text strings by making them lowercase, removing non-letter characters, and splitting into words.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Looping over each text string and processing each character inside the regex substitution and splitting.
- How many times: Once for each text string, and inside each string, operations run proportional to its length.
As the number of texts and their length grow, the cleaning time grows roughly in proportion to total characters.
| Input Size (n texts) | Approx. Operations (proportional to total characters) |
|---|---|
| 10 (short texts) | Low, quick cleaning |
| 100 (medium texts) | About 10 times more work |
| 1000 (long texts) | About 100 times more work |
Pattern observation: The time grows roughly linearly with the total amount of text to clean.
Time Complexity: O(n * m)
This means the cleaning time grows linearly with the number of texts (n) and the average length of each text (m).
[X] Wrong: "The cleaning time depends only on the number of texts, not their length."
[OK] Correct: Each text's length affects how many characters are processed, so longer texts take more time even if the count is the same.
Understanding how text cleaning scales helps you explain your approach clearly and shows you think about efficiency in real data tasks.
"What if we added a nested loop to check each word against a list of stopwords? How would the time complexity change?"