0
0
Elasticsearchquery~10 mins

Custom analyzers in Elasticsearch - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Custom analyzers
Define tokenizer
Define filters
Create custom analyzer
Apply analyzer to text
Tokenize text
Apply filters to tokens
Output processed tokens
Custom analyzers combine a tokenizer and filters to process text into tokens for searching.
Execution Sample
Elasticsearch
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  }
}
This code creates an index with a custom analyzer that tokenizes text, lowercases it, and removes accents.
Execution Table
StepActionInput Text / TokensOutput Tokens
1Input text to analyze"Café Déjà Vu""Café Déjà Vu"
2Tokenize with standard tokenizer"Café Déjà Vu"["Café", "Déjà", "Vu"]
3Apply lowercase filter["Café", "Déjà", "Vu"]["café", "déjà", "vu"]
4Apply asciifolding filter["café", "déjà", "vu"]["cafe", "deja", "vu"]
5Output final tokens["cafe", "deja", "vu"]["cafe", "deja", "vu"]
💡 All filters applied; final tokens ready for indexing or searching.
Variable Tracker
VariableStartAfter TokenizeAfter LowercaseAfter AsciifoldingFinal
tokensN/A["Café", "Déjà", "Vu"]["café", "déjà", "vu"]["cafe", "deja", "vu"]["cafe", "deja", "vu"]
Key Moments - 3 Insights
Why do tokens change after applying the lowercase filter?
Because the lowercase filter converts all tokens to lowercase, as shown in step 3 of the execution_table where tokens change from ["Café", "Déjà", "Vu"] to ["café", "déjà", "vu"].
What does the asciifolding filter do to tokens?
It removes accents and converts characters to their ASCII equivalents, as seen in step 4 where ["café", "déjà", "vu"] becomes ["cafe", "deja", "vu"].
Why is the tokenizer step important before filters?
Because the tokenizer splits the input text into tokens, which filters then process. Without tokenization (step 2), filters cannot work on individual words.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what are the tokens immediately after tokenization (step 2)?
A["cafe", "deja", "vu"]
B["Café", "Déjà", "Vu"]
C["café", "déjà", "vu"]
D["Cafe", "Deja", "Vu"]
💡 Hint
Check the Output Tokens column at step 2 in the execution_table.
At which step do tokens lose their accents?
AStep 4 - Asciifolding filter
BStep 3 - Lowercase filter
CStep 2 - Tokenize
DStep 5 - Output final tokens
💡 Hint
Look at the change from accented to unaccented tokens in the Output Tokens column.
If we remove the lowercase filter, what would be the tokens after asciifolding?
A["cafe", "deja", "vu"]
B["Café", "Déjà", "Vu"]
C["Cafe", "Deja", "Vu"]
D["café", "déjà", "vu"]
💡 Hint
Consider that asciifolding removes accents but does not change case; lowercase filter changes case.
Concept Snapshot
Custom analyzers in Elasticsearch combine a tokenizer and filters.
Tokenizer splits text into tokens.
Filters modify tokens (e.g., lowercase, asciifolding).
Define in index settings under analysis.analyzer.
Used to control how text is indexed and searched.
Full Transcript
Custom analyzers in Elasticsearch let you control how text is broken into tokens and processed. First, a tokenizer splits the text into words or tokens. Then, filters change these tokens, for example by making them lowercase or removing accents. In the example, the text "Café Déjà Vu" is tokenized into ["Café", "Déjà", "Vu"]. The lowercase filter changes these to ["café", "déjà", "vu"]. The asciifolding filter then removes accents, resulting in ["cafe", "deja", "vu"]. This process helps make searching more flexible and accurate. You define custom analyzers in the index settings under the analysis section, specifying the tokenizer and filters you want to use.