Hallucination detection means finding when a model says something untrue or made up. The key metrics are Precision and Recall. Precision tells us how many detected hallucinations were actually real hallucinations. Recall tells us how many real hallucinations the model found out of all that existed. We want both high, but recall is often more important because missing a hallucination means trusting wrong info. The F1 score balances precision and recall to give one clear number.
Hallucination detection in Prompt Engineering / GenAI - Model Metrics & Evaluation
| Predicted Hallucination | Predicted Not Hallucination |
|-------------------------|-----------------------------|
| True Positive (TP) | False Positive (FP) |
| False Negative (FN) | True Negative (TN) |
TP: Model correctly flagged hallucination
FP: Model flagged correct info as hallucination
FN: Model missed a hallucination
TN: Model correctly identified truthful info
If we focus on high precision, the model rarely calls something a hallucination unless very sure. This means fewer false alarms but might miss some hallucinations (lower recall). This is good if false alarms confuse users.
If we focus on high recall, the model catches almost all hallucinations but may wrongly flag some true info (lower precision). This is better when missing any hallucination is risky, like in medical advice.
Choosing depends on what is worse: missing hallucinations or wrongly warning users.
Good: Precision and recall both above 0.8 means the model finds most hallucinations and rarely mistakes true info. F1 score near 0.85 or higher shows balanced performance.
Bad: Precision below 0.5 means many false alarms, annoying users. Recall below 0.5 means many hallucinations missed, risking trust. F1 score below 0.6 shows poor detection.
- Accuracy paradox: If hallucinations are rare, a model that always says "no hallucination" can have high accuracy but is useless.
- Data leakage: If test data is too similar to training, metrics look better than real life.
- Overfitting: Model may detect hallucinations only in training style, failing on new types.
- Ignoring class imbalance: Hallucinations are often rare, so metrics like accuracy mislead.
Your hallucination detection model has 98% accuracy but only 12% recall on hallucinations. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses 88% of hallucinations (low recall), so it fails to warn users about most wrong info. High accuracy is misleading because hallucinations are rare. Improving recall is critical.