When designing Agent APIs, key metrics focus on response accuracy, latency, and robustness. Accuracy measures if the agent returns correct and relevant answers. Latency checks how fast the agent responds, important for user experience. Robustness ensures the agent handles unexpected inputs without failure. These metrics matter because a good API must be reliable, fast, and correct to be useful in real-world applications.
Agent API design patterns in Agentic Ai - Model Metrics & Evaluation
| Predicted Correct | Predicted Incorrect |
|-------------------|---------------------|
| True Positive (TP) | False Positive (FP) |
| False Negative (FN)| True Negative (TN) |
Example:
TP = 80 (correct responses)
FP = 10 (incorrect but accepted)
FN = 5 (missed correct responses)
TN = 5 (correctly rejected wrong inputs)
Total samples = 100
This matrix helps measure precision and recall of the agent's responses.
Precision means how many responses the agent gave that were actually correct. High precision means fewer wrong answers.
Recall means how many of all possible correct answers the agent found. High recall means the agent misses fewer correct answers.
For example, a customer support agent API should have high precision to avoid giving wrong advice. But a research assistant agent API should have high recall to find as many relevant facts as possible, even if some are less precise.
- Good: Precision > 0.9, Recall > 0.85, Latency < 1 second, Robustness handles 99% of unexpected inputs without failure.
- Bad: Precision < 0.6 (many wrong answers), Recall < 0.5 (misses many correct answers), Latency > 5 seconds (slow response), Frequent crashes or errors on unusual inputs.
- Accuracy paradox: High overall accuracy can hide poor performance on rare but important queries.
- Data leakage: Training on data too similar to test data inflates metrics falsely.
- Overfitting: Agent performs well on training queries but poorly on new, real-world inputs.
- Ignoring latency: A very accurate agent that responds too slowly harms user experience.
- Not measuring robustness: Failing to test how the agent handles unexpected or malformed inputs.
Your agent API has 98% accuracy but only 12% recall on critical queries. Is it good for production? Why or why not?
Answer: No, it is not good. Although accuracy is high, the very low recall means the agent misses most important queries. This can cause serious problems because many correct answers are never found. Improving recall is critical before production use.
