Elasticsearchquery~5 mins

Analyzer components (tokenizer, filters) in Elasticsearch

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

An analyzer breaks text into smaller pieces to help search engines understand it better. Tokenizers split text into words, and filters change or clean those words.

When you want to search text ignoring punctuation or special characters.

When you need to make searches case-insensitive so 'Apple' and 'apple' match.

When you want to remove common words like 'the' or 'and' to focus on important words.

When you want to break text into meaningful parts like words or numbers.

When you want to change words to their base form, like 'running' to 'run'.

Syntax

Elasticsearch

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop"]
        }
      },
      "filter": {
        "stop": {
          "type": "stop",
          "stopwords": ["the", "and"]
        }
      }
    }
  }
}

The tokenizer splits text into tokens (words).

The filter array applies changes to tokens, like making them lowercase or removing stop words.

Examples

This analyzer splits text by spaces and makes all words lowercase.

Elasticsearch

{
  "tokenizer": "whitespace",
  "filter": ["lowercase"]
}

This analyzer uses the standard tokenizer and removes common stop words like 'the' and 'and'.

Elasticsearch

{
  "tokenizer": "standard",
  "filter": ["lowercase", "stop"]
}

This analyzer treats the whole text as one token and makes it lowercase.

Elasticsearch

{
  "tokenizer": "keyword",
  "filter": ["lowercase"]
}

Sample Program

This example creates an index with a custom analyzer that splits text into words, makes them lowercase, and removes 'the' and 'and'. Then it analyzes a sentence to show the tokens.

Elasticsearch

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop"]
        }
      },
      "filter": {
        "stop": {
          "type": "stop",
          "stopwords": ["the", "and"]
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "The quick brown fox jumps over the lazy dog and runs away"
}

OutputSuccess

Important Notes

Tokenizers break text into tokens, usually words or terms.

Filters can remove, change, or add tokens after tokenizing.

Stop filters remove common words that don't add meaning to searches.

Summary

Analyzers use tokenizers and filters to prepare text for searching.

Tokenizers split text into smaller pieces called tokens.

Filters modify tokens to improve search quality, like lowercasing or removing stop words.