The Reflection and self-critique pattern focuses on improving AI agents by evaluating their own outputs and decisions. Key metrics include accuracy to measure correctness, precision and recall to understand error types, and F1 score to balance these. These metrics help the agent identify where it makes mistakes and how to improve. Without these, self-critique would lack clear guidance.
Reflection and self-critique pattern in Agentic AI - Model Metrics & Evaluation
Actual \ Predicted | Positive | Negative
-------------------|----------|---------
Positive | 80 | 20
Negative | 10 | 90
This matrix shows the agent's decisions: 80 true positives (correct), 20 false negatives (missed), 10 false positives (wrongly flagged), and 90 true negatives (correctly ignored). The agent uses this to reflect on errors.
Reflection helps balance precision and recall. For example, a medical AI must have high recall to catch all diseases (few misses), even if precision drops (some false alarms). A spam filter AI needs high precision to avoid marking good emails as spam, even if some spam slips through (lower recall). Self-critique guides the agent to adjust this balance based on goals.
Good: High accuracy (e.g., 90%+), balanced precision and recall (both above 80%), and F1 score close to 1. This means the agent correctly identifies most cases and makes few mistakes.
Bad: High accuracy but very low recall (e.g., 10%), meaning the agent misses many true cases. Or very low precision, causing many false alarms. These show poor self-critique and need improvement.
- Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., 95% accuracy but misses all rare cases).
- Data leakage: If the agent learns from future or test data, metrics look better but are not real.
- Overfitting indicators: Very high training metrics but poor test metrics show the agent is not generalizing well.
- Ignoring recall or precision: Focusing on one metric alone can hide serious problems.
Your agent has 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?
Answer: No, it is not good. The agent misses 88% of fraud cases (low recall), which is dangerous. High accuracy is misleading because fraud is rare. The agent needs better recall to catch fraud effectively.