0
0
Prompt Engineering / GenAIml~8 mins

Similarity search and retrieval in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Similarity search and retrieval
Which metric matters for Similarity search and retrieval and WHY

In similarity search, the goal is to find items most like a query. The key metrics are Precision, Recall, and F1 score. Precision tells us how many of the retrieved items are truly similar. Recall tells us how many of the truly similar items we found. F1 balances both. We want high recall to not miss good matches, and high precision to avoid wrong matches. Sometimes, Mean Average Precision (MAP) is used to measure ranking quality. These metrics help us know if the search is accurate and useful.

Confusion matrix for Similarity search
          | Retrieved Similar | Retrieved Not Similar
----------|-------------------|---------------------
Actually Similar    | TP                | FN
Actually Not Similar| FP                | TN

Where:
- TP (True Positive): Correctly retrieved similar items
- FP (False Positive): Retrieved items that are not similar
- FN (False Negative): Similar items missed by retrieval
- TN (True Negative): Items correctly not retrieved

Total items = TP + FP + FN + TN

Example:
If we have 100 items, 30 are truly similar to query.
Model retrieves 40 items, 25 are truly similar (TP=25), 15 are not (FP=15).
Missed similar items = 5 (FN=5), rest are TN=55.
Precision = 25 / (25 + 15) = 0.625
Recall = 25 / (25 + 5) = 0.833
Precision vs Recall tradeoff with examples

Imagine a photo app that finds similar pictures. If it shows many photos, it may find most similar ones (high recall) but also show wrong ones (low precision). If it shows fewer photos, it may be very sure about them (high precision) but miss some good matches (low recall).

In a music recommendation system, high recall means suggesting many songs you might like, but some may be off. High precision means only suggesting songs you really like, but fewer suggestions.

Choosing between precision and recall depends on what matters more: missing good matches (recall) or showing wrong matches (precision).

What good vs bad metric values look like for Similarity search
  • Good: Precision and recall both above 0.8 means most retrieved items are correct and most similar items are found.
  • Acceptable: Precision around 0.7 and recall around 0.7 means moderate quality, some errors and misses.
  • Bad: Precision below 0.5 or recall below 0.5 means many wrong items retrieved or many similar items missed.
  • Mean Average Precision (MAP) close to 1.0 is excellent; near 0.5 is random guessing.
Common pitfalls in Similarity search metrics
  • Accuracy paradox: If most items are not similar, accuracy can be high by always saying "not similar" but this is useless.
  • Ignoring recall: High precision but low recall means many good matches are missed.
  • Ignoring precision: High recall but low precision means many wrong matches confuse users.
  • Data leakage: Using test items in training can inflate metrics falsely.
  • Overfitting: Model performs well on known data but poorly on new queries.
Self-check question

Your similarity search model has 98% accuracy but only 12% recall on similar items. Is it good for production? Why or why not?

Answer: No, it is not good. The high accuracy is misleading because most items are not similar, so the model just says "not similar" often. The very low recall means it misses almost all truly similar items, which defeats the purpose of similarity search.

Key Result
Precision and recall are key metrics to balance finding most similar items while avoiding wrong matches in similarity search.