0
0
Elasticsearchquery~10 mins

Tokenizers (standard, whitespace, pattern) in Elasticsearch - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Tokenizers (standard, whitespace, pattern)
Input Text
Choose Tokenizer Type
Standard
Tokens Output
Used in Analysis
Text goes into a tokenizer, which splits it into tokens based on rules: standard splits on punctuation, whitespace splits on spaces, pattern splits by a regex.
Execution Sample
Elasticsearch
POST _analyze
{
  "tokenizer": "whitespace",
  "text": "Hello, world! This is Elasticsearch."
}
This example uses the whitespace tokenizer to split text into tokens separated by spaces.
Execution Table
StepInput TextTokenizer TypeTokenization RuleTokens Produced
1Hello, world! This is Elasticsearch.whitespaceSplit by spaces["Hello,", "world!", "This", "is", "Elasticsearch."]
2Hello, world! This is Elasticsearch.standardSplit by punctuation and spaces["Hello", "world", "This", "is", "Elasticsearch"]
3Hello, world! This is Elasticsearch.patternSplit by regex '\\W+' (non-word chars)["Hello", "world", "This", "is", "Elasticsearch"]
4N/AN/AN/AEnd of tokenization
💡 Tokenization ends after splitting input text into tokens based on chosen tokenizer rules.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3Final
input_text"Hello, world! This is Elasticsearch.""Hello, world! This is Elasticsearch.""Hello, world! This is Elasticsearch.""Hello, world! This is Elasticsearch.""Hello, world! This is Elasticsearch."
tokens_whitespace[]["Hello,", "world!", "This", "is", "Elasticsearch."]["Hello,", "world!", "This", "is", "Elasticsearch."]["Hello,", "world!", "This", "is", "Elasticsearch."]["Hello,", "world!", "This", "is", "Elasticsearch."]
tokens_standard[][]["Hello", "world", "This", "is", "Elasticsearch"]["Hello", "world", "This", "is", "Elasticsearch"]["Hello", "world", "This", "is", "Elasticsearch"]
tokens_pattern[][][]["Hello", "world", "This", "is", "Elasticsearch"]["Hello", "world", "This", "is", "Elasticsearch"]
Key Moments - 3 Insights
Why does the standard tokenizer produce tokens without punctuation while the whitespace tokenizer does not?
The standard tokenizer splits on punctuation and discards it as part of its process (see Step 2 in execution_table), while the whitespace tokenizer only splits by spaces and keeps punctuation attached to words (Step 1).
Why does the pattern tokenizer split differently than whitespace tokenizer?
Pattern tokenizer uses a regex to split on non-word characters (Step 3), so it removes punctuation as separators, unlike whitespace tokenizer which splits only on spaces (Step 1).
Can tokens include punctuation with whitespace tokenizer?
Yes, whitespace tokenizer keeps punctuation attached to words because it only splits on spaces (Step 1 tokens include commas and exclamation marks).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what tokens does the standard tokenizer produce at Step 2?
A["Hello,", "world!", "This", "is", "Elasticsearch."]
B["Hello", "world", "This", "is", "Elasticsearch"]
C["hello", "world", "this", "is", "elasticsearch"]
D["Hello world This is Elasticsearch"]
💡 Hint
Check the Tokens Produced column at Step 2 in execution_table.
At which step does the tokenizer split text by regex pattern?
AStep 1
BStep 2
CStep 3
DStep 4
💡 Hint
Look for 'pattern' tokenizer and regex '\\W+' in execution_table.
If the input text had no spaces, which tokenizer would produce the fewest tokens?
AWhitespace tokenizer
BStandard tokenizer
CPattern tokenizer
DAll produce the same tokens
💡 Hint
Whitespace tokenizer splits only on spaces, so no spaces means one token (see variable_tracker for splitting rules).
Concept Snapshot
Tokenizers split text into tokens for search analysis.
Standard tokenizer splits on punctuation and spaces, discarding punctuation.
Whitespace tokenizer splits only on spaces, keeps punctuation.
Pattern tokenizer splits by regex pattern (e.g., non-word chars).
Choose tokenizer based on how you want to break text.
Full Transcript
Tokenizers in Elasticsearch break text into smaller pieces called tokens. The standard tokenizer splits text by punctuation and spaces. The whitespace tokenizer splits text only by spaces and keeps punctuation attached to words. The pattern tokenizer splits text using a regular expression pattern, such as splitting on non-word characters. For example, given the text 'Hello, world! This is Elasticsearch.', the whitespace tokenizer produces tokens including punctuation like 'Hello,' and 'world!'. The standard tokenizer produces tokens without punctuation like 'Hello' and 'world'. The pattern tokenizer splits by regex and removes punctuation as separators. Understanding these differences helps choose the right tokenizer for your search needs.