When handling retrieval failures gracefully, the key metric is Recall. Recall tells us how many of the relevant items the system successfully retrieved. If retrieval fails often, recall drops, meaning the system misses important information. High recall ensures the system finds most of what it needs, even if some attempts fail. Additionally, Failure Rate (percentage of retrieval attempts that fail) is important to track to understand how often the system cannot get data.
Handling retrieval failures gracefully in Agentic AI - Model Metrics & Evaluation
Retrieval Outcome Confusion Matrix (Simplified):
| Retrieved Relevant | Retrieved Irrelevant |
---------------|--------------------|---------------------|
Relevant Items | TP | FN |
Irrelevant | FP | TN |
Where:
- TP (True Positive): Relevant data retrieved successfully
- FN (False Negative): Relevant data not retrieved (failure)
- FP (False Positive): Irrelevant data retrieved
- TN (True Negative): Irrelevant data not retrieved
Total retrieval attempts = TP + FP + FN + TN
Failure Rate = FN / (TP + FN) (how often relevant data retrieval fails)In retrieval, Recall is about finding all the relevant data, while Precision is about how many retrieved items are actually relevant.
Example 1: Search engine
If the system retrieves many results but misses some relevant ones, recall is low. If it retrieves only a few but very accurate results, precision is high but recall may be low. For retrieval failures, recall matters more because missing important data is worse than extra irrelevant data.
Example 2: Medical diagnosis retrieval
Missing a relevant medical record (low recall) can be dangerous. So, the system should tolerate some irrelevant data (lower precision) to keep recall high and avoid retrieval failures.
- Good: Recall above 90%, Failure Rate below 10%. The system finds most relevant data and rarely fails to retrieve.
- Bad: Recall below 70%, Failure Rate above 30%. Many relevant items are missed, causing poor user experience or wrong decisions.
- Precision can be moderate (70-80%) if recall is high, since some irrelevant data is acceptable to avoid failures.
- Ignoring recall: Focusing only on precision can hide retrieval failures, as the system may retrieve few but very accurate items, missing many relevant ones.
- Accuracy paradox: High overall accuracy can be misleading if the dataset is imbalanced (many irrelevant items). The system might appear good but fail to retrieve relevant data.
- Data leakage: If retrieval uses future or test data accidentally, metrics look better but don't reflect real failures.
- Overfitting: The system may perform well on training data retrieval but fail in real scenarios, causing high failure rates.
Your retrieval system has 98% accuracy but only 12% recall on relevant data. Is it good for production? Why not?
Answer: No, it is not good. Despite high accuracy, the very low recall means the system misses most relevant data. This leads to many retrieval failures, which harms user trust and system usefulness. Improving recall is critical even if accuracy drops slightly.