When building custom AI tools, the key metric depends on the tool's goal. For example, if the tool classifies text, accuracy shows overall correctness. But if the tool detects rare events, recall is vital to catch as many true cases as possible. If the tool must avoid false alarms, precision matters more. Choosing the right metric helps you know if your tool works well for its purpose.
Building custom tools in Agentic AI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Building custom tools
Which metric matters for Building custom tools and WHY
Confusion matrix example
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP): 40 | False Negative (FN): 10 |
| False Positive (FP): 5 | True Negative (TN): 45 |
Total samples = 40 + 10 + 5 + 45 = 100
Precision = TP / (TP + FP) = 40 / (40 + 5) = 0.89
Recall = TP / (TP + FN) = 40 / (40 + 10) = 0.80
Accuracy = (TP + TN) / Total = (40 + 45) / 100 = 0.85
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84
Precision vs Recall tradeoff with examples
Imagine building a custom tool to detect spam emails:
- High Precision: The tool marks emails as spam only when very sure. Few good emails get wrongly marked (few false alarms). But it might miss some spam (lower recall).
- High Recall: The tool catches almost all spam emails. But it might mark some good emails as spam (more false alarms).
Choosing between precision and recall depends on what is worse: missing spam or wrongly blocking good emails.
What "good" vs "bad" metric values look like for Building custom tools
Good metrics:
- Accuracy above 85% for balanced tasks
- Precision and recall both above 80% for critical detection tools
- F1 score close to 1 means balanced and strong performance
Bad metrics:
- Accuracy near 50% on balanced data means guessing
- Precision very low (e.g., 30%) means many false alarms
- Recall very low (e.g., 20%) means many misses
- Big gap between precision and recall shows imbalance
Common pitfalls when evaluating custom tools
- Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., 95% accuracy but only detecting the majority class).
- Data leakage: Using future or test data during training inflates metrics falsely.
- Overfitting: Very high training accuracy but low test accuracy means the tool learned noise, not real patterns.
- Ignoring metric tradeoffs: Focusing only on accuracy without considering precision or recall can hide problems.
Self-check question
Your custom tool has 98% accuracy but only 12% recall on detecting fraud cases. Is it good for production? Why or why not?
Answer: No, it is not good. The tool misses 88% of fraud cases (low recall), which is dangerous. High accuracy likely comes from many non-fraud cases being correctly identified, but the tool fails at its main goal: catching fraud.
Key Result
Choosing the right metric like precision or recall is key to knowing if your custom AI tool works well for its specific task.
Practice
1. What is the main purpose of building custom tools for an AI agent?
easy
Solution
Step 1: Understand what custom tools do
Custom tools add new abilities or skills to an AI, making it better at certain jobs.Step 2: Compare options to the purpose
Only To add special skills that help the AI perform specific tasks talks about adding special skills, which matches the purpose of custom tools.Final Answer:
To add special skills that help the AI perform specific tasks -> Option BQuick Check:
Custom tools = add special skills [OK]
Hint: Custom tools add new skills to AI for tasks [OK]
Common Mistakes:
- Thinking custom tools speed up AI generally
- Confusing tool purpose with model size
- Assuming tools change AI language automatically
2. Which of the following is the correct way to define a custom tool in Python for an AI agent?
easy
Solution
Step 1: Recall required fields for a custom tool
A custom tool needs a name, description, and a function to work properly.Step 2: Check which option includes all three
Only tool = Tool(name='search', description='Find info', func=search_function) has name, description, and func parameters correctly set.Final Answer:
tool = Tool(name='search', description='Find info', func=search_function) -> Option DQuick Check:
Tool needs name, description, and func [OK]
Hint: Include name, description, and func when defining tools [OK]
Common Mistakes:
- Omitting description or name
- Passing parameters in wrong order
- Using wrong parameter names
3. Given this Python code for a custom tool, what will be the output when calling
tool.func('hello')?
def shout(text):
return text.upper() + '!!!'
tool = Tool(name='shout', description='Make text loud', func=shout)medium
Solution
Step 1: Understand the function behavior
The function shout converts text to uppercase and adds three exclamation marks.Step 2: Apply the function to 'hello'
Calling shout('hello') returns 'HELLO!!!'. Since tool.func points to shout, tool.func('hello') does the same.Final Answer:
'HELLO!!!' -> Option AQuick Check:
shout('hello') = 'HELLO!!!' [OK]
Hint: Check function logic and apply input to predict output [OK]
Common Mistakes:
- Ignoring uppercase conversion
- Missing exclamation marks
- Assuming func is not callable
4. You wrote this custom tool but get an error when using it. What is the likely problem?
def add_numbers(a, b):
return a + b
tool = Tool(name='adder', description='Add two numbers', func=add_numbers)
result = tool.func(5)medium
Solution
Step 1: Check function parameters
add_numbers requires two inputs: a and b.Step 2: Check how tool.func is called
tool.func(5) provides only one argument, causing an error for missing the second argument.Final Answer:
Missing one argument when calling tool.func -> Option CQuick Check:
Function needs 2 args, only 1 given [OK]
Hint: Match function parameters with call arguments [OK]
Common Mistakes:
- Ignoring function argument count
- Thinking description length causes error
- Assuming tool name uniqueness causes runtime error
5. You want to build a custom tool that summarizes text by returning the first 10 words. Which code correctly defines this tool's function?
hard
Solution
Step 1: Understand the goal of the function
The function should return the first 10 words, not characters or last words.Step 2: Analyze each option
def summarize(text): return ' '.join(text.split()[:10]) splits text into words and joins the first 10 words correctly. def summarize(text): return text[:10] returns first 10 characters, not words. def summarize(text): return text.split()[-10:] returns last 10 words. def summarize(text): return len(text.split()) returns word count, not summary.Final Answer:
def summarize(text): return ' '.join(text.split()[:10]) -> Option AQuick Check:
First 10 words = def summarize(text): return ' '.join(text.split()[:10]) [OK]
Hint: Split text and join first 10 words for summary [OK]
Common Mistakes:
- Returning characters instead of words
- Taking last words instead of first
- Returning word count instead of summary
