Hybrid search combines semantic understanding and keyword matching to find the best results. The key metrics are Recall and Precision. Recall shows how many relevant results the search finds, important to not miss good answers. Precision shows how many found results are actually relevant, important to avoid noise. Since hybrid search balances meaning and exact words, both metrics help check if it finds enough good matches without too many wrong ones.
Hybrid search (semantic + keyword) in Prompt Engineering / GenAI - Model Metrics & Evaluation
|---------------------------|
| | Predicted |
| Actual | Relevant | Not |
|----------|----------|-----|
| Relevant | TP | FN |
| Not Rel. | FP | TN |
|---------------------------|
TP = Correctly found relevant results
FP = Found results that are not relevant
FN = Relevant results missed by search
TN = Correctly ignored irrelevant results
Metrics use these counts:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
In hybrid search, tuning for more semantic matching can increase recall by finding more relevant results even if keywords differ. But this may lower precision by including less exact matches. Tuning for strict keyword matching can increase precision by returning exact hits but lower recall by missing related results.
Example 1: A legal document search needs high precision to avoid irrelevant cases. So keyword matching is emphasized.
Example 2: A customer support search wants high recall to find all helpful answers, so semantic search is emphasized.
Good: Precision and recall both above 0.8 means the search finds most relevant results and keeps irrelevant ones low.
Bad: Precision below 0.5 means many irrelevant results confuse users. Recall below 0.5 means many relevant results are missed.
Balanced metrics around 0.7 are often acceptable depending on use case.
- Accuracy paradox: High accuracy can be misleading if most results are irrelevant and the model just returns few results.
- Data leakage: Using test queries that appear in training can inflate metrics.
- Overfitting: Tuning too much on keyword matching may miss semantic matches, hurting recall.
- Ignoring user intent: Metrics alone don't capture if results satisfy user needs.
No, it is not good. The high accuracy likely means the model returns very few results, mostly irrelevant ones correctly ignored. But 12% recall means it misses 88% of relevant results, so users won't find what they need. Improving recall is critical for hybrid search usefulness.