0
0
ElasticsearchHow-ToBeginner · 4 min read

How to Use edge_ngram for Autocomplete in Elasticsearch

Use the edge_ngram tokenizer in your Elasticsearch index mapping to break words into prefixes for autocomplete. Define a custom analyzer with edge_ngram in the index analyzer and use a standard analyzer for search to match user input efficiently.
📐

Syntax

The edge_ngram tokenizer splits text into smaller parts starting from the beginning of the word, which helps autocomplete by matching prefixes.

Key parts:

  • tokenizer: Defines how text is split; edge_ngram creates prefixes.
  • min_gram: Minimum length of generated tokens.
  • max_gram: Maximum length of generated tokens.
  • analyzer: Combines tokenizer and filters for indexing and searching.
json
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": ["lowercase"]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}
💻

Example

This example creates an index with edge_ngram for autocomplete on the name field. It then indexes sample documents and shows a search query that returns autocomplete suggestions.

json
PUT /autocomplete_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": ["lowercase"]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

POST /autocomplete_example/_doc/1
{
  "name": "Apple"
}

POST /autocomplete_example/_doc/2
{
  "name": "Application"
}

POST /autocomplete_example/_doc/3
{
  "name": "Banana"
}

GET /autocomplete_example/_search
{
  "query": {
    "match": {
      "name": "app"
    }
  }
}
Output
{ "hits": { "total": {"value": 2, "relation": "eq"}, "hits": [ {"_source": {"name": "Apple"}}, {"_source": {"name": "Application"}} ] } }
⚠️

Common Pitfalls

Common mistakes when using edge_ngram for autocomplete include:

  • Using edge_ngram on the search_analyzer which causes poor matching; it should only be on the index analyzer.
  • Setting min_gram too high, missing short prefixes.
  • Not using lowercase filter, causing case-sensitive mismatches.
  • Applying edge_ngram tokenizer on fields that do not need prefix matching, which can increase index size unnecessarily.

Correct usage example:

json
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": ["lowercase"]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}
📊

Quick Reference

  • edge_ngram tokenizer: Generates prefixes for autocomplete.
  • min_gram: Smallest prefix length (usually 1).
  • max_gram: Longest prefix length (depends on expected input length).
  • index analyzer: Uses edge_ngram tokenizer for prefix indexing.
  • search analyzer: Uses standard tokenizer for normal search input.
  • lowercase filter: Ensures case-insensitive matching.

Key Takeaways

Use edge_ngram tokenizer only in the index analyzer to generate prefixes for autocomplete.
Set min_gram to 1 and max_gram to a reasonable length to cover expected input prefixes.
Use a standard tokenizer with lowercase filter for the search analyzer to match user input correctly.
Avoid applying edge_ngram on the search analyzer to prevent poor search results.
Always include lowercase filter to make autocomplete case-insensitive.