0
0
Elasticsearchquery~5 mins

TF-IDF and BM25 scoring in Elasticsearch - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What does TF-IDF stand for and what is its purpose in search engines?
TF-IDF stands for Term Frequency-Inverse Document Frequency. It helps search engines find how important a word is in a document compared to all documents, giving higher scores to words that appear often in one document but rarely in others.
Click to reveal answer
intermediate
Explain the role of BM25 in Elasticsearch scoring.
BM25 is a ranking function used by Elasticsearch to score documents based on how well they match a search query. It improves on TF-IDF by considering term frequency saturation and document length, making search results more relevant.
Click to reveal answer
beginner
How does term frequency (TF) affect document scoring?
Term frequency counts how often a word appears in a document. The more times a word appears, the more important it is for that document, increasing the document's score for queries containing that word.
Click to reveal answer
beginner
What is inverse document frequency (IDF) and why is it important?
Inverse document frequency measures how rare a word is across all documents. Rare words get higher IDF scores, so they have more impact on ranking, helping to highlight unique and meaningful terms.
Click to reveal answer
intermediate
Why does BM25 use document length normalization?
BM25 adjusts scores based on document length to avoid favoring longer documents just because they have more words. This keeps scoring fair by balancing term frequency with document size.
Click to reveal answer
What does the 'IDF' part of TF-IDF measure?
AHow often a term appears in one document
BThe length of a document
CThe total number of documents
DHow rare a term is across all documents
Which scoring method does Elasticsearch use by default?
ATF-IDF
BBM25
CPageRank
DCosine Similarity
Why does BM25 include document length normalization?
ATo balance scores so longer documents don't get unfair advantage
BTo ignore term frequency
CTo favor longer documents
DTo count the number of unique terms
In TF-IDF, what happens if a term appears in many documents?
AIts IDF score decreases
BIt is ignored
CIts TF score increases
DIts IDF score increases
Which factor does BM25 consider that basic TF-IDF does not?
ATerm frequency
BInverse document frequency
CDocument length normalization
DNumber of documents
Describe how TF-IDF helps rank documents in a search engine.
Think about how often a word appears in one document versus many documents.
You got /4 concepts.
    Explain why BM25 is considered an improvement over TF-IDF in Elasticsearch.
    Consider how BM25 handles long documents and repeated terms.
    You got /4 concepts.