Challenge - 5 Problems
Tokenizer Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Standard Tokenizer on a Sample Text
Given the following Elasticsearch analyzer configuration using the standard tokenizer, what is the output tokens for the input text
"Hello, world! This is Elasticsearch."?Elasticsearch
{
"analyzer": {
"my_standard_analyzer": {
"tokenizer": "standard"
}
}
}
Input text: "Hello, world! This is Elasticsearch."Attempts:
2 left
💡 Hint
The standard tokenizer splits text into words by removing most punctuation and preserves case.
✗ Incorrect
The standard tokenizer breaks text into words by removing most punctuation and preserves the original case of tokens.
❓ Predict Output
intermediate2:00remaining
Whitespace Tokenizer Output for a Given Text
What tokens does the whitespace tokenizer produce for the input text
"Quick brown fox jumps over the lazy dog."?Elasticsearch
{
"analyzer": {
"my_whitespace_analyzer": {
"tokenizer": "whitespace"
}
}
}
Input text: "Quick brown fox jumps over the lazy dog."Attempts:
2 left
💡 Hint
Whitespace tokenizer splits tokens only on spaces and does not lowercase or remove punctuation.
✗ Incorrect
The whitespace tokenizer splits text only on spaces, so punctuation remains attached to tokens and casing is preserved.
❓ Predict Output
advanced2:00remaining
Pattern Tokenizer with Custom Regex Output
Using the pattern tokenizer with the regex
"\\W+" (non-word characters as delimiters), what tokens are produced from the input text "Email me at user@example.com!"?Elasticsearch
{
"analyzer": {
"my_pattern_analyzer": {
"filter": ["lowercase"],
"tokenizer": {
"type": "pattern",
"pattern": "\\W+"
}
}
}
}
Input text: "Email me at user@example.com!"Attempts:
2 left
💡 Hint
The pattern tokenizer splits on non-word characters and lowercase filter converts tokens to lowercase.
✗ Incorrect
The pattern tokenizer splits on any sequence of non-word characters (like @, ., !), so 'user@example.com' splits into 'user', 'example', 'com'. The lowercase filter makes all tokens lowercase.
❓ Predict Output
advanced2:00remaining
Effect of Pattern Tokenizer with Complex Regex
What tokens result from using a pattern tokenizer with the regex
"[ ,.!]+" on the input text "Hello, world! Welcome to Elasticsearch."?Elasticsearch
{
"analyzer": {
"complex_pattern_analyzer": {
"tokenizer": {
"type": "pattern",
"pattern": "[ ,.!]+"
}
}
}
}
Input text: "Hello, world! Welcome to Elasticsearch."Attempts:
2 left
💡 Hint
The pattern tokenizer splits on any sequence of spaces, commas, periods, or exclamation marks.
✗ Incorrect
The regex splits tokens on spaces, commas, periods, and exclamation marks, so punctuation is removed and tokens are split cleanly.
🧠 Conceptual
expert2:00remaining
Choosing the Correct Tokenizer for Case-Sensitive Search
You want to create an Elasticsearch analyzer that preserves the original casing of tokens and splits text only on whitespace. Which tokenizer should you use to achieve this?
Attempts:
2 left
💡 Hint
Think about which tokenizer splits only on spaces and does not change case by default.
✗ Incorrect
The whitespace tokenizer splits tokens only on whitespace and does not lowercase tokens by default, preserving original casing. The standard tokenizer lowercases tokens by default. Pattern tokenizer can split on whitespace but often requires filters to preserve case.