When handling tool execution results in AI agents, the key metrics are accuracy and execution success rate. Accuracy tells us how often the tool's output matches the expected result. Execution success rate shows how often the tool runs without errors. These metrics matter because a tool that runs correctly but gives wrong results is not useful, and a tool that often fails to run is unreliable. Together, they help us trust the AI agent's decisions based on tool outputs.
Handling tool execution results in Agentic AI - Model Metrics & Evaluation
Imagine a tool that classifies tasks as successful or failed. The confusion matrix looks like this:
| Predicted Success | Predicted Failure |
|-------------------|-------------------|
| True Success (TP) | False Failure (FN) |
| False Success (FP) | True Failure (TN) |
Where:
- TP: Tool executed and result was correct.
- FP: Tool said success but result was wrong.
- FN: Tool said failure but result was actually correct.
- TN: Tool failed and result was wrong.
From this, we calculate precision and recall to understand tool reliability.
Precision measures how often the tool's successful execution results are actually correct. High precision means few wrong successes.
Recall measures how many of the actual correct results the tool successfully executes. High recall means the tool rarely misses correct executions.
Example:
- If the AI agent controls a medical tool, high recall is critical to not miss any correct diagnoses.
- If the AI agent controls a financial transaction tool, high precision is important to avoid false successful transactions.
Balancing precision and recall depends on the tool's purpose and risk tolerance.
Good metrics:
- Accuracy above 90% means the tool usually gives correct results.
- Precision and recall both above 85% show balanced and reliable execution.
- Low failure rate (less than 5%) means the tool rarely crashes or errors.
Bad metrics:
- Accuracy below 70% means many wrong results.
- Precision very low (e.g., 50%) means many false successful executions.
- Recall very low (e.g., 40%) means many missed correct executions.
- High failure rate (above 20%) means the tool is unstable.
- Accuracy paradox: If most executions are failures, a tool that always fails can show high accuracy but is useless.
- Data leakage: Using future or test data to evaluate tool results can give overly optimistic metrics.
- Overfitting: Tool tuned too much on training data may fail on new tasks, showing poor real-world metrics.
- Ignoring failure types: Not distinguishing between execution errors and wrong results can hide issues.
Your AI agent's tool has 98% accuracy but only 12% recall on successful executions. Is this good for production?
Answer: No. The tool rarely catches correct executions (low recall), so it misses many good results. Despite high accuracy, it is not reliable for production because it fails to perform its task well.