0
0
Prompt Engineering / GenAIml~8 mins

Hallucination detection in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Hallucination detection
Which metric matters for Hallucination detection and WHY

Hallucination detection means finding when a model says something untrue or made up. The key metrics are Precision and Recall. Precision tells us how many detected hallucinations were actually real hallucinations. Recall tells us how many real hallucinations the model found out of all that existed. We want both high, but recall is often more important because missing a hallucination means trusting wrong info. The F1 score balances precision and recall to give one clear number.

Confusion matrix for Hallucination detection
      | Predicted Hallucination | Predicted Not Hallucination |
      |-------------------------|-----------------------------|
      | True Positive (TP)       | False Positive (FP)          |
      | False Negative (FN)      | True Negative (TN)           |

      TP: Model correctly flagged hallucination
      FP: Model flagged correct info as hallucination
      FN: Model missed a hallucination
      TN: Model correctly identified truthful info
    
Precision vs Recall tradeoff with examples

If we focus on high precision, the model rarely calls something a hallucination unless very sure. This means fewer false alarms but might miss some hallucinations (lower recall). This is good if false alarms confuse users.

If we focus on high recall, the model catches almost all hallucinations but may wrongly flag some true info (lower precision). This is better when missing any hallucination is risky, like in medical advice.

Choosing depends on what is worse: missing hallucinations or wrongly warning users.

What good vs bad metric values look like

Good: Precision and recall both above 0.8 means the model finds most hallucinations and rarely mistakes true info. F1 score near 0.85 or higher shows balanced performance.

Bad: Precision below 0.5 means many false alarms, annoying users. Recall below 0.5 means many hallucinations missed, risking trust. F1 score below 0.6 shows poor detection.

Common pitfalls in Hallucination detection metrics
  • Accuracy paradox: If hallucinations are rare, a model that always says "no hallucination" can have high accuracy but is useless.
  • Data leakage: If test data is too similar to training, metrics look better than real life.
  • Overfitting: Model may detect hallucinations only in training style, failing on new types.
  • Ignoring class imbalance: Hallucinations are often rare, so metrics like accuracy mislead.
Self-check question

Your hallucination detection model has 98% accuracy but only 12% recall on hallucinations. Is it good for production? Why or why not?

Answer: No, it is not good. The model misses 88% of hallucinations (low recall), so it fails to warn users about most wrong info. High accuracy is misleading because hallucinations are rare. Improving recall is critical.

Key Result
For hallucination detection, high recall is crucial to catch most false info, balanced with precision to avoid false alarms.