0
0
Elasticsearchquery~10 mins

Standard analyzer in Elasticsearch - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Standard analyzer
Input Text
Lowercase Filter
Standard Tokenizer
Stop Words Filter
Output Tokens
The standard analyzer processes text by first splitting into tokens, then lowercasing them, removing common stop words, and outputting the cleaned tokens.
Execution Sample
Elasticsearch
{
  "analyzer": "standard",
  "text": "The Quick Brown Foxes jumped over the lazy dogs."
}
This example analyzes the input text using the standard analyzer to produce tokens.
Execution Table
StepActionInputOutput
1Input text receivedThe Quick Brown Foxes jumped over the lazy dogs.The Quick Brown Foxes jumped over the lazy dogs.
2Standard tokenizer splits textThe Quick Brown Foxes jumped over the lazy dogs.["The", "Quick", "Brown", "Foxes", "jumped", "over", "the", "lazy", "dogs"]
3Lowercase filter applied["The", "Quick", "Brown", "Foxes", "jumped", "over", "the", "lazy", "dogs"]["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dogs"]
4Stop words removed["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dogs"]["quick", "brown", "foxes", "jumped", "lazy", "dogs"]
💡 All stop words removed, final tokens ready for indexing or searching.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4
textThe Quick Brown Foxes jumped over the lazy dogs.["The", "Quick", "Brown", "Foxes", "jumped", "over", "the", "lazy", "dogs"]["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dogs"]["quick", "brown", "foxes", "jumped", "lazy", "dogs"]
Key Moments - 3 Insights
Why does the analyzer convert all letters to lowercase?
Lowercasing ensures that searches are case-insensitive, so 'Quick' and 'quick' are treated the same, as shown in step 3 of the execution_table.
What does the standard tokenizer do with punctuation?
The standard tokenizer splits text into words by removing punctuation and spaces, as seen in step 2 where the sentence is split into tokens without periods.
Why are some words like 'the' removed in the final output?
Common words called stop words are removed to reduce noise and improve search relevance, demonstrated in step 4 where 'the' is removed.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the output after the lowercase filter (Step 3)?
A"the quick brown foxes jumped over the lazy dogs."
B["the", "quick", "brown", "foxes"]
C"The Quick Brown Foxes jumped over the lazy dogs."
D["quick", "brown", "foxes", "jumped"]
💡 Hint
Check the Output column in Step 3 of the execution_table.
At which step are stop words removed from the tokens?
AStep 1
BStep 2
CStep 4
DStep 3
💡 Hint
Look at the Action column and Output tokens in the execution_table.
If the input text was 'Cats and dogs', which token would likely be removed by the stop words filter?
A"and"
B"cats"
C"dogs"
D"cats and dogs"
💡 Hint
Refer to the stop words removal step in the execution_table and variable_tracker.
Concept Snapshot
Standard analyzer:
- Splits text into tokens by words
- Lowercases tokens
- Removes common stop words
- Used for indexing and searching
- Helps match queries case-insensitively and ignore noise words
Full Transcript
The standard analyzer in Elasticsearch processes text by first splitting the text into individual words using the standard tokenizer, which removes punctuation. Then it converts all letters to lowercase to ensure case-insensitive matching. After tokenizing and lowercasing, it removes common stop words like 'the' and 'and' to reduce noise. The final output is a list of clean tokens ready for indexing or searching. This process helps Elasticsearch find relevant matches regardless of case and ignores common words that do not add meaning.