0
0
Elasticsearchquery~5 mins

Tokenizers (standard, whitespace, pattern) in Elasticsearch

Choose your learning style9 modes available
Introduction

Tokenizers break text into smaller pieces called tokens. This helps search engines understand and find words better.

When you want to split a sentence into words for searching.
When you want to split text only by spaces, keeping punctuation.
When you want to split text using a custom pattern like commas or special characters.
Syntax
Elasticsearch
PUT /my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_standard_tokenizer": {
          "type": "standard"
        },
        "my_whitespace_tokenizer": {
          "type": "whitespace"
        },
        "my_pattern_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

The standard tokenizer splits text by words and removes punctuation.

The whitespace tokenizer splits text only by spaces, keeping punctuation as part of tokens.

The pattern tokenizer splits text using a regular expression pattern you provide.

Examples
This creates a tokenizer that splits text into words, removing punctuation.
Elasticsearch
PUT /test_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_standard": {
          "type": "standard"
        }
      }
    }
  }
}
This tokenizer splits text only by spaces, so punctuation stays with words.
Elasticsearch
PUT /test_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_whitespace": {
          "type": "whitespace"
        }
      }
    }
  }
}
This tokenizer splits text wherever there is a comma.
Elasticsearch
PUT /test_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_pattern": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}
Sample Program

This example shows how each tokenizer breaks the text differently.

Elasticsearch
POST /_analyze
{
  "tokenizer": "standard",
  "text": "Hello, world! Welcome to Elasticsearch."
}

POST /_analyze
{
  "tokenizer": "whitespace",
  "text": "Hello, world! Welcome to Elasticsearch."
}

POST /_analyze
{
  "tokenizer": {"type": "pattern", "pattern": ","},
  "text": "apple,banana,orange"
}
OutputSuccess
Important Notes

The standard tokenizer is the default and works well for most text.

The whitespace tokenizer is useful when punctuation should stay with words.

The pattern tokenizer is powerful for custom splitting but needs a correct regex pattern.

Summary

Tokenizers split text into tokens for searching.

standard removes punctuation, whitespace splits by spaces only, pattern splits by your pattern.

Choose the tokenizer based on how you want to break your text.