Elasticsearchquery~5 mins

Tokenizers (standard, whitespace, pattern) in Elasticsearch

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Tokenizers break text into smaller pieces called tokens. This helps search engines understand and find words better.

When you want to split a sentence into words for searching.

When you want to split text only by spaces, keeping punctuation.

When you want to split text using a custom pattern like commas or special characters.

Syntax

Elasticsearch

PUT /my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_standard_tokenizer": {
          "type": "standard"
        },
        "my_whitespace_tokenizer": {
          "type": "whitespace"
        },
        "my_pattern_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

The standard tokenizer splits text by words and removes punctuation.

The whitespace tokenizer splits text only by spaces, keeping punctuation as part of tokens.

The pattern tokenizer splits text using a regular expression pattern you provide.

Examples

This creates a tokenizer that splits text into words, removing punctuation.

Elasticsearch

PUT /test_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_standard": {
          "type": "standard"
        }
      }
    }
  }
}

This tokenizer splits text only by spaces, so punctuation stays with words.

Elasticsearch

PUT /test_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_whitespace": {
          "type": "whitespace"
        }
      }
    }
  }
}

This tokenizer splits text wherever there is a comma.

Elasticsearch

PUT /test_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_pattern": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

Sample Program

This example shows how each tokenizer breaks the text differently.

Elasticsearch

POST /_analyze
{
  "tokenizer": "standard",
  "text": "Hello, world! Welcome to Elasticsearch."
}

POST /_analyze
{
  "tokenizer": "whitespace",
  "text": "Hello, world! Welcome to Elasticsearch."
}

POST /_analyze
{
  "tokenizer": {"type": "pattern", "pattern": ","},
  "text": "apple,banana,orange"
}

OutputSuccess

Important Notes

The standard tokenizer is the default and works well for most text.

The whitespace tokenizer is useful when punctuation should stay with words.

The pattern tokenizer is powerful for custom splitting but needs a correct regex pattern.

Summary

Tokenizers split text into tokens for searching.

standard removes punctuation, whitespace splits by spaces only, pattern splits by your pattern.

Choose the tokenizer based on how you want to break your text.