0
0
Agentic AIml~8 mins

Reflection and self-critique pattern in Agentic AI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Reflection and self-critique pattern
Which metric matters for Reflection and self-critique pattern and WHY

The Reflection and self-critique pattern focuses on improving AI agents by evaluating their own outputs and decisions. Key metrics include accuracy to measure correctness, precision and recall to understand error types, and F1 score to balance these. These metrics help the agent identify where it makes mistakes and how to improve. Without these, self-critique would lack clear guidance.

Confusion matrix example
      Actual \ Predicted | Positive | Negative
      -------------------|----------|---------
      Positive           |    80    |   20    
      Negative           |    10    |   90    
    

This matrix shows the agent's decisions: 80 true positives (correct), 20 false negatives (missed), 10 false positives (wrongly flagged), and 90 true negatives (correctly ignored). The agent uses this to reflect on errors.

Precision vs Recall tradeoff with examples

Reflection helps balance precision and recall. For example, a medical AI must have high recall to catch all diseases (few misses), even if precision drops (some false alarms). A spam filter AI needs high precision to avoid marking good emails as spam, even if some spam slips through (lower recall). Self-critique guides the agent to adjust this balance based on goals.

What "good" vs "bad" metric values look like

Good: High accuracy (e.g., 90%+), balanced precision and recall (both above 80%), and F1 score close to 1. This means the agent correctly identifies most cases and makes few mistakes.

Bad: High accuracy but very low recall (e.g., 10%), meaning the agent misses many true cases. Or very low precision, causing many false alarms. These show poor self-critique and need improvement.

Common pitfalls in metrics for Reflection and self-critique
  • Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., 95% accuracy but misses all rare cases).
  • Data leakage: If the agent learns from future or test data, metrics look better but are not real.
  • Overfitting indicators: Very high training metrics but poor test metrics show the agent is not generalizing well.
  • Ignoring recall or precision: Focusing on one metric alone can hide serious problems.
Self-check question

Your agent has 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?

Answer: No, it is not good. The agent misses 88% of fraud cases (low recall), which is dangerous. High accuracy is misleading because fraud is rare. The agent needs better recall to catch fraud effectively.

Key Result
Reflection and self-critique rely on balanced precision, recall, and F1 to guide AI improvement effectively.