Tokenization basics in Data Analysis Python - Time & Space Complexity
We want to understand how the time needed to split text into words grows as the text gets longer.
How does the work change when we have more words to split?
Analyze the time complexity of the following code snippet.
text = "This is a simple sentence for tokenization."
tokens = text.split()
print(tokens)
This code splits a sentence into a list of words by spaces.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each character in the text to find spaces.
- How many times: Once for every character in the input text.
As the text gets longer, the number of characters to check grows directly with the text size.
| Input Size (n characters) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: The work grows in a straight line with the input size.
Time Complexity: O(n)
This means the time to split text grows directly with the number of characters in the text.
[X] Wrong: "Splitting text into words takes the same time no matter how long the text is."
[OK] Correct: The code must look at each character to find spaces, so longer text means more work.
Understanding how tokenization time grows helps you explain how text processing scales in real projects.
"What if we split text using a more complex rule, like punctuation and spaces? How would the time complexity change?"