Analyzer components (tokenizer, filters) in Elasticsearch - Time & Space Complexity
When Elasticsearch analyzes text, it breaks it into parts using tokenizers and filters. Understanding how long this takes helps us know how fast searches and indexing will be.
We want to see how the time to analyze text grows as the text gets longer.
Analyze the time complexity of the following analyzer configuration.
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop"]
}
}
}
}
}
This analyzer splits text into words, makes them lowercase, and removes common stop words.
Look at what repeats as the text grows.
- Primary operation: Tokenizing the text into words.
- How many times: Once per word in the input text.
- Additional operations: Each token passes through filters like lowercase and stop word removal, also once per token.
As the text gets longer, the number of words grows roughly in proportion.
| Input Size (words) | Approx. Operations |
|---|---|
| 10 | About 10 tokenizations + 20 filter passes |
| 100 | About 100 tokenizations + 200 filter passes |
| 1000 | About 1000 tokenizations + 2000 filter passes |
Pattern observation: The work grows directly with the number of words in the text.
Time Complexity: O(n)
This means the time to analyze text grows in a straight line with the number of words.
[X] Wrong: "Adding more filters will multiply the time by the number of filters squared."
[OK] Correct: Each filter processes tokens one by one, so total time grows linearly with tokens and filters, not squared.
Knowing how analyzers scale helps you explain performance in search systems. It shows you understand how text processing affects speed, a useful skill in many jobs.
What if we changed the tokenizer to a more complex one that splits text differently? How would the time complexity change?