An analyzer breaks text into smaller pieces to help search engines understand it better. Tokenizers split text into words, and filters change or clean those words.
0
0
Analyzer components (tokenizer, filters) in Elasticsearch
Introduction
When you want to search text ignoring punctuation or special characters.
When you need to make searches case-insensitive so 'Apple' and 'apple' match.
When you want to remove common words like 'the' or 'and' to focus on important words.
When you want to break text into meaningful parts like words or numbers.
When you want to change words to their base form, like 'running' to 'run'.
Syntax
Elasticsearch
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop"]
}
},
"filter": {
"stop": {
"type": "stop",
"stopwords": ["the", "and"]
}
}
}
}
}The tokenizer splits text into tokens (words).
The filter array applies changes to tokens, like making them lowercase or removing stop words.
Examples
This analyzer splits text by spaces and makes all words lowercase.
Elasticsearch
{
"tokenizer": "whitespace",
"filter": ["lowercase"]
}This analyzer uses the standard tokenizer and removes common stop words like 'the' and 'and'.
Elasticsearch
{
"tokenizer": "standard",
"filter": ["lowercase", "stop"]
}This analyzer treats the whole text as one token and makes it lowercase.
Elasticsearch
{
"tokenizer": "keyword",
"filter": ["lowercase"]
}Sample Program
This example creates an index with a custom analyzer that splits text into words, makes them lowercase, and removes 'the' and 'and'. Then it analyzes a sentence to show the tokens.
Elasticsearch
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop"]
}
},
"filter": {
"stop": {
"type": "stop",
"stopwords": ["the", "and"]
}
}
}
}
}
GET /my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "The quick brown fox jumps over the lazy dog and runs away"
}OutputSuccess
Important Notes
Tokenizers break text into tokens, usually words or terms.
Filters can remove, change, or add tokens after tokenizing.
Stop filters remove common words that don't add meaning to searches.
Summary
Analyzers use tokenizers and filters to prepare text for searching.
Tokenizers split text into smaller pieces called tokens.
Filters modify tokens to improve search quality, like lowercasing or removing stop words.