0
0
Elasticsearchquery~20 mins

Tokenizers (standard, whitespace, pattern) in Elasticsearch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Tokenizer Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of Standard Tokenizer on a Sample Text
Given the following Elasticsearch analyzer configuration using the standard tokenizer, what is the output tokens for the input text "Hello, world! This is Elasticsearch."?
Elasticsearch
{
  "analyzer": {
    "my_standard_analyzer": {
      "tokenizer": "standard"
    }
  }
}

Input text: "Hello, world! This is Elasticsearch."
A["hello,", "world!", "this", "is", "elasticsearch."]
B["Hello,", "world!", "This", "is", "Elasticsearch."]
C["Hello", "world", "This", "is", "Elasticsearch"]
D["hello", "world", "this", "is", "elasticsearch"]
Attempts:
2 left
💡 Hint
The standard tokenizer splits text into words by removing most punctuation and preserves case.
Predict Output
intermediate
2:00remaining
Whitespace Tokenizer Output for a Given Text
What tokens does the whitespace tokenizer produce for the input text "Quick brown fox jumps over the lazy dog."?
Elasticsearch
{
  "analyzer": {
    "my_whitespace_analyzer": {
      "tokenizer": "whitespace"
    }
  }
}

Input text: "Quick brown fox jumps over the lazy dog."
A]".god" ,"yzal" ,"eht" ,"revo" ,"spmuj" ,"xof" ,"nworb" ,"kciuQ"[
B["Quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."]
C["Quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
D["quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Attempts:
2 left
💡 Hint
Whitespace tokenizer splits tokens only on spaces and does not lowercase or remove punctuation.
Predict Output
advanced
2:00remaining
Pattern Tokenizer with Custom Regex Output
Using the pattern tokenizer with the regex "\\W+" (non-word characters as delimiters), what tokens are produced from the input text "Email me at user@example.com!"?
Elasticsearch
{
  "analyzer": {
    "my_pattern_analyzer": {
      "filter": ["lowercase"],
      "tokenizer": {
        "type": "pattern",
        "pattern": "\\W+"
      }
    }
  }
}

Input text: "Email me at user@example.com!"
A["email", "me", "at", "user", "example", "com"]
B["Email", "me", "at", "user@example.com"]
C["email", "me", "at", "user@example.com"]
D["Email", "me", "at", "user", "example", "com"]
Attempts:
2 left
💡 Hint
The pattern tokenizer splits on non-word characters and lowercase filter converts tokens to lowercase.
Predict Output
advanced
2:00remaining
Effect of Pattern Tokenizer with Complex Regex
What tokens result from using a pattern tokenizer with the regex "[ ,.!]+" on the input text "Hello, world! Welcome to Elasticsearch."?
Elasticsearch
{
  "analyzer": {
    "complex_pattern_analyzer": {
      "tokenizer": {
        "type": "pattern",
        "pattern": "[ ,.!]+"
      }
    }
  }
}

Input text: "Hello, world! Welcome to Elasticsearch."
A["Hello", "world", "Welcome", "to", "Elasticsearch"]
B["Hello", "world!", "Welcome", "to", "Elasticsearch."]
C["Hello", "world", "Welcome", "to", "Elasticsearch."]
D["Hello,", "world", "Welcome", "to", "Elasticsearch"]
Attempts:
2 left
💡 Hint
The pattern tokenizer splits on any sequence of spaces, commas, periods, or exclamation marks.
🧠 Conceptual
expert
2:00remaining
Choosing the Correct Tokenizer for Case-Sensitive Search
You want to create an Elasticsearch analyzer that preserves the original casing of tokens and splits text only on whitespace. Which tokenizer should you use to achieve this?
AUse the <strong>standard</strong> tokenizer with a lowercase filter.
BUse the <strong>standard</strong> tokenizer without any filters.
CUse the <strong>pattern</strong> tokenizer with pattern "\\s+" and lowercase filter.
DUse the <strong>whitespace</strong> tokenizer without any filters.
Attempts:
2 left
💡 Hint
Think about which tokenizer splits only on spaces and does not change case by default.