Why lexical analysis tokenizes source code in Compiler Design - Performance Analysis
We want to understand how the work done by lexical analysis grows as the source code gets bigger.
Specifically, how does tokenizing many characters affect the time it takes?
Analyze the time complexity of the following lexical analysis process.
function tokenize(source) {
let tokens = []
let i = 0
while (i < source.length) {
let token = extractNextToken(source, i)
tokens.push(token)
i += token.length
}
return tokens
}
This code reads the source code string and breaks it into tokens one by one until done.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: The while loop that moves through the source code character by character.
- How many times: It runs once for each token, covering all characters in the source.
As the source code length grows, the number of tokens grows roughly in proportion.
| Input Size (n characters) | Approx. Operations (tokens extracted) |
|---|---|
| 10 | About 10 tokens or fewer |
| 100 | About 100 tokens or fewer |
| 1000 | About 1000 tokens or fewer |
Pattern observation: The work grows roughly in direct proportion to the source size.
Time Complexity: O(n)
This means the time to tokenize grows linearly with the size of the source code.
[X] Wrong: "Tokenizing takes much longer than reading characters one by one because tokens are complex."
[OK] Correct: Each character is processed once, and tokens are formed by grouping characters, so the total work still grows linearly with input size.
Understanding how lexical analysis scales helps you explain compiler steps clearly and shows you can reason about program efficiency.
"What if extractNextToken sometimes looks ahead multiple characters? How would that affect the time complexity?"