In multi-query retrieval, the goal is to find relevant items for several queries at once. The key metrics are Recall and Mean Average Precision (MAP). Recall tells us how many relevant items we found out of all possible relevant items, which is important because missing relevant results hurts user experience. MAP measures how well the system ranks relevant items higher, which matters because users usually look at top results first.
Multi-query retrieval in Prompt Engineering / GenAI - Model Metrics & Evaluation
For each query, results can be:
Relevant (R) and Retrieved (Ret) = True Positive (TP) Not Relevant and Retrieved = False Positive (FP) Relevant and Not Retrieved = False Negative (FN) Not Relevant and Not Retrieved = True Negative (TN)
Example for one query: TP = 8, FP = 2, FN = 4, TN = 86 Total items = 100 Precision = TP / (TP + FP) = 8 / (8 + 2) = 0.8 Recall = TP / (TP + FN) = 8 / (8 + 4) = 0.67
In multi-query retrieval, sometimes retrieving more results increases recall but lowers precision because more irrelevant items appear. For example, a search engine showing many results catches more relevant pages (high recall) but also shows unrelated pages (low precision). If the system shows fewer results, precision improves but recall drops, missing some relevant items.
Choosing the right balance depends on the use case. For a legal document search, high recall is critical to not miss any important documents. For a product search on a shopping site, high precision is better to show only relevant products quickly.
- Good: Recall above 0.8 means most relevant items are found. MAP above 0.7 means relevant items rank near the top.
- Bad: Recall below 0.5 means many relevant items are missed. MAP below 0.4 means relevant items are buried deep in results.
Good systems balance recall and precision to provide useful, relevant results quickly for all queries.
- Accuracy paradox: Accuracy can be misleading if relevant items are rare. A system that returns mostly irrelevant items can still have high accuracy.
- Data leakage: If queries or relevant items leak into training, metrics look better but don't reflect real performance.
- Overfitting: High metrics on training queries but poor results on new queries show overfitting.
- Ignoring ranking: Treating retrieval as just relevant or not misses how well relevant items are ranked, which MAP captures.
Your multi-query retrieval model has 98% accuracy but only 12% recall on relevant items. Is it good for production? Why or why not?
Answer: No, it is not good. High accuracy can happen if most items are irrelevant and the model mostly predicts irrelevant. But 12% recall means it finds very few relevant items, which defeats the purpose of retrieval. The model misses most relevant results, so users will be unhappy.