LangGraph models for stateful agents track sequences of actions and states over time. Key metrics include accuracy for correct state predictions, precision and recall for detecting important events or decisions, and F1 score to balance precision and recall. These metrics matter because the agent must remember past states correctly and make accurate decisions based on them. A wrong state prediction can cause wrong actions later.
LangGraph for stateful agents in Agentic AI - Model Metrics & Evaluation
Predicted State
| S1 | S2 | S3 |
-------------------------
S1| 40 | 5 | 3 |
S2| 4 | 35 | 6 |
S3| 2 | 7 | 38 |
Total samples = 40+5+3+4+35+6+2+7+38 = 140
This matrix shows how often the agent predicted each state correctly (diagonal) or confused it with others (off-diagonal). From this, we calculate precision and recall per state.
Imagine the agent detects a critical event in the state graph. High precision means when it says the event happened, it really did (few false alarms). High recall means it finds most of the actual events (few misses).
For safety-critical agents, missing an event (low recall) can be dangerous, so recall is prioritized. For agents where false alarms cause costly actions, precision is more important.
- Good: Accuracy > 90%, Precision and Recall both > 85%, F1 score > 0.85. This means the agent reliably tracks states and detects events.
- Bad: Accuracy < 70%, Precision or Recall < 50%. This means the agent often mispredicts states or misses important events, leading to poor decisions.
- Accuracy paradox: High accuracy can be misleading if some states are very common. The agent might ignore rare but important states.
- Data leakage: If future states leak into training, evaluation metrics become unrealistically high.
- Overfitting: The agent may memorize training sequences but fail on new ones, causing poor real-world performance.
Your LangGraph agent has 98% accuracy but only 12% recall on detecting a critical state change. Is it good for production? Why or why not?
Answer: No, it is not good. Despite high accuracy, the agent misses most critical state changes (low recall). This can cause serious failures because important events are not detected.