For AutoGen conversational agents, key metrics include accuracy for intent recognition, precision and recall for entity extraction, and F1 score to balance both. These metrics matter because the agent must correctly understand user requests (high recall) and avoid false triggers (high precision) to respond helpfully and naturally.
AutoGen for conversational agents in Agentic AI - Model Metrics & Evaluation
Predicted
| Yes | No
---+-------+-------
Yes| 80 | 20
No | 10 | 90
TP = 80 (correctly predicted 'Yes')
FP = 10 (wrongly predicted 'Yes')
FN = 20 (missed 'Yes')
TN = 90 (correctly predicted 'No')
From this, precision = 80 / (80 + 10) = 0.89, recall = 80 / (80 + 20) = 0.80.
Imagine the agent detects when a user wants to book a flight. If precision is high but recall is low, the agent rarely makes mistakes but misses many booking requests, frustrating users. If recall is high but precision is low, the agent tries to book flights too often, annoying users with wrong actions. Balancing precision and recall with F1 score helps the agent respond accurately and reliably.
- Good: Precision and recall above 0.85, F1 score above 0.85, showing balanced and reliable understanding.
- Bad: Precision or recall below 0.5, indicating many false alarms or missed intents, leading to poor user experience.
- Accuracy paradox: High accuracy can be misleading if one intent dominates the data.
- Data leakage: Training on future conversation turns can inflate metrics falsely.
- Overfitting: Very high training metrics but poor real user performance.
Your conversational agent has 98% accuracy but only 12% recall on booking requests. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the agent misses most booking requests, failing to help users even though overall accuracy looks high.