What is TF-IDF and BM25 scoring in Elasticsearch?

Elasticsearchquery~7 mins

TF-IDF and BM25 scoring in Elasticsearch

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

TF-IDF and BM25 help find the most important words in documents. They score how well documents match your search words.

When you want to rank search results by relevance in Elasticsearch.

When you need to find documents that best match a user's query.

When you want to improve search quality by scoring words based on importance.

When you want to compare different scoring methods in Elasticsearch.

When building a search engine that returns the most useful documents first.

Syntax

Elasticsearch

GET /your_index/_search
{
  "query": {
    "match": {
      "field_name": {
        "query": "search words",
        "operator": "and"
      }
    }
  },
  "explain": true
}

Elasticsearch uses BM25 as the default scoring method since version 5.0.

You can enable explanation to see how TF-IDF or BM25 scores are calculated.

Examples

Basic search using BM25 scoring on the 'content' field.

Elasticsearch

GET /my_index/_search
{
  "query": {
    "match": {
      "content": "quick brown fox"
    }
  }
}

Search with explanation enabled to see detailed BM25 scoring.

Elasticsearch

GET /my_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "quick brown fox",
        "operator": "or"
      }
    }
  },
  "explain": true
}

Set up an index to use TF-IDF (classic similarity) instead of BM25.

Elasticsearch

PUT /my_index
{
  "settings": {
    "similarity": {
      "my_tfidf": {
        "type": "classic"
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}

Sample Program

This example creates an index using TF-IDF scoring for the 'title' field. It adds two documents and searches for 'quick fox' in titles. Explanation shows how TF-IDF scores the documents.

Elasticsearch

PUT /books
{
  "settings": {
    "similarity": {
      "my_tfidf": {
        "type": "classic"
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "similarity": "my_tfidf"
      },
      "description": {
        "type": "text"
      }
    }
  }
}

POST /books/_doc/1
{
  "title": "The quick brown fox",
  "description": "A story about a quick fox."
}

POST /books/_doc/2
{
  "title": "Lazy dog sleeps",
  "description": "A story about a lazy dog."
}

GET /books/_search
{
  "query": {
    "match": {
      "title": {
        "query": "quick fox",
        "operator": "and"
      }
    }
  },
  "explain": true
}

OutputSuccess

Important Notes

BM25 is better for most modern search needs because it balances term frequency and document length.

TF-IDF (classic similarity) is older but useful for understanding basic scoring concepts.

Use the 'explain' option in your search to see how scores are calculated step-by-step.

Summary

TF-IDF and BM25 score how important words are in documents for search.

Elasticsearch uses BM25 by default but you can switch to TF-IDF if needed.

Use scoring to get better search results that match user queries well.