For tool-using agents, key metrics include task success rate and tool invocation accuracy. Task success rate shows how often the agent completes the goal correctly using tools. Tool invocation accuracy measures if the agent calls the right tool at the right time. These metrics matter because the agent must both choose and use tools properly to solve problems.
Test cases for tool-using agents in Agentic AI - Model Metrics & Evaluation
| Predicted Tool Use |
|-------------------|
| True Positive (TP): Agent correctly uses the tool when needed
| False Positive (FP): Agent uses tool when not needed
| True Negative (TN): Agent correctly does not use tool when not needed
| False Negative (FN): Agent fails to use tool when needed
Total samples = TP + FP + TN + FN
Precision means when the agent uses a tool, it is usually the right choice. High precision avoids wasting resources on wrong tools.
Recall means the agent uses the tool whenever it is needed. High recall avoids missing important tool uses.
Example: For a cooking assistant agent, high precision means it rarely uses the blender when not needed. High recall means it always uses the blender when the recipe calls for it.
Good: Task success rate above 90%, tool invocation precision and recall above 85%. This means the agent reliably uses tools correctly and completes tasks.
Bad: Task success rate below 60%, precision or recall below 50%. The agent often misuses tools or misses using them, leading to failed tasks.
- Accuracy paradox: High overall accuracy can hide poor tool use if most tasks don't require tools.
- Data leakage: Testing on tasks seen during training inflates success rates.
- Overfitting: Agent memorizes tool use patterns but fails on new tasks.
- Ignoring timing: Using the right tool too late can still cause task failure but may not be captured by simple metrics.
Your tool-using agent has 98% task success rate but only 12% recall on tool invocation. Is it good for production? Why or why not?
Answer: No, it is not good. The agent rarely uses tools when needed (low recall), so it might succeed on simple tasks but fail on complex ones requiring tools. High task success alone can be misleading.