When using tools or functions in AI, the key metric is accuracy of the function call results. This means how often the tool returns the correct or expected output. We also care about response time because slow calls can hurt user experience. For some tools, precision and recall matter if the tool filters or selects information, ensuring relevant results without missing important data.
Tool usage (function calling) in Prompt Engineering / GenAI - Model Metrics & Evaluation
| Predicted Success | Predicted Failure |
|-------------------|-------------------|
| True Positive (TP) | False Positive (FP)|
| False Negative (FN)| True Negative (TN) |
TP: Function call succeeded and output was correct
FP: Function call succeeded but output was wrong
FN: Function call failed but should have succeeded
TN: Function call failed and was expected to fail
Metrics like precision = TP / (TP + FP) and recall = TP / (TP + FN) help measure how well the tool works.
If a tool is very precise, it means when it returns a result, it is usually correct. But it might miss some correct results (low recall). For example, a search tool that only returns very sure answers might miss some relevant info.
If a tool has high recall, it finds most correct results but might include wrong ones (low precision). For example, a tool that returns many possible answers but some are wrong.
Choosing precision or recall depends on the use case. For critical tasks, high recall avoids missing important info. For user-facing tools, high precision avoids confusing wrong results.
Good: Precision and recall above 90%, low error rate, fast response time under 1 second.
Bad: Precision or recall below 50%, many wrong outputs, slow or failed calls.
Example: A tool with 95% precision but 40% recall misses many correct outputs, which may be bad for completeness.
- Accuracy paradox: High overall accuracy can hide poor performance on rare but important cases.
- Data leakage: Testing tool on data it already saw inflates metrics.
- Overfitting: Tool works well on test data but fails in real use.
- Ignoring latency: Fast but inaccurate tools or slow but accurate tools may both be problematic.
Your tool usage model has 98% accuracy but only 12% recall on important function calls. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the tool misses most important calls, even if overall accuracy looks high. This can cause failures in critical tasks.