0
0
Agentic AIml~8 mins

Tool selection by the agent in Agentic AI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Tool selection by the agent
Which metric matters for tool selection by the agent and WHY

When an agent chooses tools to solve tasks, the key metric is accuracy of the tool's output. This shows how often the chosen tool gives the right answer.

Additionally, latency (speed) matters because slow tools can delay the agent's response.

For some tasks, precision and recall matter if the tool filters or detects specific items. For example, if the agent picks a spam detection tool, high precision avoids false spam labels.

Overall, the agent should select tools that balance accuracy, speed, and task-specific metrics to perform well.

Confusion matrix example for tool output
      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP)  | False Negative (FN) |
      | False Positive (FP) | True Negative (TN)  |

      Example:
      TP = 80, FP = 20, FN = 10, TN = 90
      Total samples = 200
    

From this, the agent can calculate:

  • Precision = TP / (TP + FP) = 80 / (80 + 20) = 0.8
  • Recall = TP / (TP + FN) = 80 / (80 + 10) = 0.89
  • Accuracy = (TP + TN) / Total = (80 + 90) / 200 = 0.85
Precision vs Recall tradeoff in tool selection

Imagine the agent must pick a tool to detect fraud:

  • High precision means the tool rarely flags good transactions as fraud (few false alarms).
  • High recall means the tool catches most fraud cases (few missed frauds).

If the agent picks a tool with high precision but low recall, it misses many frauds.

If it picks a tool with high recall but low precision, many good transactions are wrongly flagged.

The agent must balance these based on the task's cost of errors.

Good vs Bad metric values for tool selection

Good metrics:

  • Accuracy above 90% for general tasks
  • Precision and recall above 85% for detection tasks
  • Low latency (fast response)

Bad metrics:

  • Accuracy below 70% means many wrong answers
  • Precision or recall below 50% means poor detection or many false alarms
  • High latency causing slow agent responses
Common pitfalls in evaluating tool selection metrics
  • Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., many negatives, few positives).
  • Data leakage: If the tool was tested on data it already saw, metrics are too optimistic.
  • Overfitting: Tool performs well on training data but poorly on new data.
  • Ignoring latency: A very accurate but slow tool may hurt overall agent performance.
Self-check question

Your agent selects a tool with 98% accuracy but only 12% recall on fraud detection. Is this good for production? Why or why not?

Answer: No, it is not good. Although accuracy is high, the tool misses 88% of fraud cases (low recall). This means many frauds go undetected, which is risky. The agent should pick a tool with higher recall to catch more fraud.

Key Result
Tool selection metrics focus on accuracy, precision, recall, and latency to ensure the agent chooses effective and timely tools.