Elasticsearchquery~30 mins

Tokenizers (standard, whitespace, pattern) in Elasticsearch - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Create and Test Tokenizers in Elasticsearch

📖 Scenario: You are setting up an Elasticsearch index for a small library catalog. You want to understand how different tokenizers break down text into searchable words.

🎯 Goal: Build an Elasticsearch index with three different tokenizers: standard, whitespace, and pattern. Then test each tokenizer with the same sample text to see how they split the text into tokens.

📋 What You'll Learn

Create an index called library with a custom analyzer using the standard tokenizer

Add a custom analyzer using the whitespace tokenizer

Add a custom analyzer using the pattern tokenizer with pattern \W+

Test each tokenizer by analyzing the text "Elasticsearch is great, isn't it?"

Print the tokens produced by each tokenizer

💡 Why This Matters

🌍 Real World

Tokenizers help break down text into searchable pieces in search engines like Elasticsearch, improving search accuracy.

💼 Career

Understanding tokenizers is important for roles in search engineering, data indexing, and backend development involving text search.

Progress0 / 4 steps

Create the Elasticsearch index with the standard tokenizer analyzer

Write the JSON to create an Elasticsearch index called library with a custom analyzer named standard_analyzer that uses the standard tokenizer.

Elasticsearch

{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_analyzer": {
          "tokenizer": "standard"
        }
      }
    }
  }
}
# Your code here to create the index with this settings

Need a hint?

Use the PUT method to create the index and define the standard_analyzer inside settings.analysis.analyzer.

Add whitespace and pattern tokenizer analyzers to the index settings

Add two more custom analyzers to the library index settings: whitespace_analyzer using the whitespace tokenizer, and pattern_analyzer using the pattern tokenizer with the pattern "\\W+".

Elasticsearch

PUT /library
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_analyzer": {
          "tokenizer": "standard"
        },
        "whitespace_analyzer": {
          "tokenizer": "whitespace"
        },
        "pattern_analyzer": {
          "tokenizer": "pattern",
          "filter": [],
          "tokenizer": {
            "type": "pattern",
            "pattern": "\\W+"
          }
        }
      }
    }
  }
}
# Your code here to fix the pattern_analyzer tokenizer definition

Need a hint?

Define the pattern_tokenizer under analysis.tokenizer with type pattern and the given pattern, then use it in pattern_analyzer.

Analyze the sample text with each tokenizer

Write three separate _analyze requests to test the text "Elasticsearch is great, isn't it?" using the analyzers standard_analyzer, whitespace_analyzer, and pattern_analyzer.

Elasticsearch

POST /library/_analyze
{
  "analyzer": "standard_analyzer",
  "text": "Elasticsearch is great, isn't it?"
}

# Your code here to analyze with whitespace_analyzer and pattern_analyzer

Need a hint?

Use the _analyze API with the analyzer field set to each analyzer name and the same text.

Print the tokens from each analyzer's output

Print the tokens produced by each analyzer from the previous step in this exact format:
Standard tokens: [list]
Whitespace tokens: [list]
Pattern tokens: [list]
Replace [list] with the tokens as a Python list of strings.

Elasticsearch

# Print the tokens from each analyzer's output here
# Your code here

Need a hint?

Use print statements with the exact token lists shown.