0
0
Agentic AIml~8 mins

Embedding models for semantic search in Agentic AI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Embedding models for semantic search
Which metric matters for embedding models in semantic search and WHY

For embedding models used in semantic search, the key metric is Recall@K. This measures how often the correct or relevant items appear in the top K search results. It matters because users want the right answers to show up quickly, not buried deep in the list.

Another important metric is Mean Reciprocal Rank (MRR), which captures how high the first relevant result appears. A higher MRR means users find what they want faster.

Precision is less important here because semantic search focuses on finding all relevant items, not just avoiding irrelevant ones.

Confusion matrix or equivalent visualization

Semantic search evaluation often uses a ranking table instead of a confusion matrix. Here is a simple example for one query:

Query: "apple fruit benefits"

Rank | Document ID | Relevant?
-----|-------------|----------
1    | Doc_5       | Yes (TP)
2    | Doc_12      | No (FP)
3    | Doc_3       | Yes (TP)
4    | Doc_7       | No (FP)
5    | Doc_9       | No (FP)

Total relevant docs in dataset: 3
Relevant docs retrieved in top 5: 2
Recall@5 = 2/3 = 0.67

This shows how many relevant documents appear in the top results, which is what recall@K measures.

Precision vs Recall tradeoff with concrete examples

In semantic search, recall is usually more important than precision. For example:

  • High recall, lower precision: The search returns many results including most relevant ones, but also some irrelevant. This is good if users want to see all possible answers.
  • High precision, lower recall: The search returns only very confident results but misses some relevant ones. This might frustrate users who want a complete answer.

For example, a medical literature search should have high recall to avoid missing important studies, even if some irrelevant papers appear.

What "good" vs "bad" metric values look like for semantic search

Good values:

  • Recall@10 above 0.8 means most relevant items appear in the top 10 results.
  • MRR above 0.7 means relevant results appear near the top.

Bad values:

  • Recall@10 below 0.4 means many relevant items are missed in top results.
  • MRR below 0.3 means relevant results appear too far down the list.

Low recall frustrates users because they miss important information. Low MRR means users spend more time scrolling.

Common pitfalls in metrics for embedding semantic search
  • Ignoring recall: Focusing only on precision can hide that many relevant results are missed.
  • Data leakage: If test queries or documents appear in training, metrics look artificially high.
  • Overfitting: Model performs well on test data but poorly on new queries, showing unstable recall or MRR.
  • Using accuracy: Accuracy is not meaningful for ranking tasks like semantic search.
Self-check question

Your embedding model for semantic search has 98% accuracy but only 12% recall@10 on relevant documents. Is it good for production? Why or why not?

Answer: No, it is not good. High accuracy here is misleading because most documents are irrelevant, so the model can guess irrelevant and be right often. But 12% recall@10 means it finds very few relevant results in the top 10, so users will miss important information.

Key Result
Recall@K and Mean Reciprocal Rank (MRR) are key metrics to ensure relevant results appear high in semantic search.