Elasticsearchquery~10 mins

Tokenizers (standard, whitespace, pattern) in Elasticsearch - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Tokenizers (standard, whitespace, pattern)

Input Text

↓

Choose Tokenizer Type

↓

Standard

↓

Tokens Output

↓

Used in Analysis

Text goes into a tokenizer, which splits it into tokens based on rules: standard splits on punctuation, whitespace splits on spaces, pattern splits by a regex.

Execution Sample

Elasticsearch

POST _analyze
{
  "tokenizer": "whitespace",
  "text": "Hello, world! This is Elasticsearch."
}

This example uses the whitespace tokenizer to split text into tokens separated by spaces.

Execution Table

Step	Input Text	Tokenizer Type	Tokenization Rule	Tokens Produced
1	Hello, world! This is Elasticsearch.	whitespace	Split by spaces	["Hello,", "world!", "This", "is", "Elasticsearch."]
2	Hello, world! This is Elasticsearch.	standard	Split by punctuation and spaces	["Hello", "world", "This", "is", "Elasticsearch"]
3	Hello, world! This is Elasticsearch.	pattern	Split by regex '\\W+' (non-word chars)	["Hello", "world", "This", "is", "Elasticsearch"]
4	N/A	N/A	N/A	End of tokenization

💡 Tokenization ends after splitting input text into tokens based on chosen tokenizer rules.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	Final
input_text	"Hello, world! This is Elasticsearch."	"Hello, world! This is Elasticsearch."	"Hello, world! This is Elasticsearch."	"Hello, world! This is Elasticsearch."	"Hello, world! This is Elasticsearch."
tokens_whitespace	[]	["Hello,", "world!", "This", "is", "Elasticsearch."]	["Hello,", "world!", "This", "is", "Elasticsearch."]	["Hello,", "world!", "This", "is", "Elasticsearch."]	["Hello,", "world!", "This", "is", "Elasticsearch."]
tokens_standard	[]	[]	["Hello", "world", "This", "is", "Elasticsearch"]	["Hello", "world", "This", "is", "Elasticsearch"]	["Hello", "world", "This", "is", "Elasticsearch"]
tokens_pattern	[]	[]	[]	["Hello", "world", "This", "is", "Elasticsearch"]	["Hello", "world", "This", "is", "Elasticsearch"]

Key Moments - 3 Insights

Why does the standard tokenizer produce tokens without punctuation while the whitespace tokenizer does not?

Why does the pattern tokenizer split differently than whitespace tokenizer?

Can tokens include punctuation with whitespace tokenizer?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what tokens does the standard tokenizer produce at Step 2?

A["Hello,", "world!", "This", "is", "Elasticsearch."]

B["Hello", "world", "This", "is", "Elasticsearch"]

C["hello", "world", "this", "is", "elasticsearch"]

D["Hello world This is Elasticsearch"]

Concept Snapshot

Tokenizers split text into tokens for search analysis.
Standard tokenizer splits on punctuation and spaces, discarding punctuation.
Whitespace tokenizer splits only on spaces, keeps punctuation.
Pattern tokenizer splits by regex pattern (e.g., non-word chars).
Choose tokenizer based on how you want to break text.

Full Transcript

Tokenizers in Elasticsearch break text into smaller pieces called tokens. The standard tokenizer splits text by punctuation and spaces. The whitespace tokenizer splits text only by spaces and keeps punctuation attached to words. The pattern tokenizer splits text using a regular expression pattern, such as splitting on non-word characters. For example, given the text 'Hello, world! This is Elasticsearch.', the whitespace tokenizer produces tokens including punctuation like 'Hello,' and 'world!'. The standard tokenizer produces tokens without punctuation like 'Hello' and 'world'. The pattern tokenizer splits by regex and removes punctuation as separators. Understanding these differences helps choose the right tokenizer for your search needs.