0
0
Data Analysis Pythondata~5 mins

Text cleaning pipeline in Data Analysis Python - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Text cleaning pipeline
O(n * m)
Understanding Time Complexity

We want to understand how the time needed to clean text grows as the amount of text grows.

How does the cleaning process scale when we have more words or sentences?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.


import re

def clean_text(texts):
    cleaned = []
    for text in texts:
        text = text.lower()
        text = re.sub(r'[^a-z ]', '', text)
        words = text.split()
        cleaned.append(words)
    return cleaned

This code cleans a list of text strings by making them lowercase, removing non-letter characters, and splitting into words.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Looping over each text string and processing each character inside the regex substitution and splitting.
  • How many times: Once for each text string, and inside each string, operations run proportional to its length.
How Execution Grows With Input

As the number of texts and their length grow, the cleaning time grows roughly in proportion to total characters.

Input Size (n texts)Approx. Operations (proportional to total characters)
10 (short texts)Low, quick cleaning
100 (medium texts)About 10 times more work
1000 (long texts)About 100 times more work

Pattern observation: The time grows roughly linearly with the total amount of text to clean.

Final Time Complexity

Time Complexity: O(n * m)

This means the cleaning time grows linearly with the number of texts (n) and the average length of each text (m).

Common Mistake

[X] Wrong: "The cleaning time depends only on the number of texts, not their length."

[OK] Correct: Each text's length affects how many characters are processed, so longer texts take more time even if the count is the same.

Interview Connect

Understanding how text cleaning scales helps you explain your approach clearly and shows you think about efficiency in real data tasks.

Self-Check

"What if we added a nested loop to check each word against a list of stopwords? How would the time complexity change?"