Bird
Raised Fist0
Agentic AIml~8 mins

Measuring agent accuracy and relevance in Agentic AI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Measuring agent accuracy and relevance
Which metric matters for measuring agent accuracy and relevance and WHY

When we measure how well an agent performs, two key ideas matter: accuracy and relevance.

Accuracy tells us how often the agent's answers or actions are correct. It is important because it shows if the agent is reliable.

Relevance shows if the agent's responses fit the user's needs or questions well. Even if an answer is correct, it might not be useful if it is not relevant.

To measure these, we use metrics like Precision, Recall, and F1 score. Precision tells us how many of the agent's positive answers were truly correct. Recall tells us how many of the true correct answers the agent found. F1 score balances both.

For agents, relevance can also be measured by user feedback or similarity scores comparing the agent's output to expected results.

Confusion matrix for agent accuracy
      |---------------------------|
      |           | Predicted     |
      | Actual    | Correct | Wrong |
      |-----------|---------|-------|
      | Correct   |   TP    |  FN   |
      | Wrong     |   FP    |  TN   |
      |---------------------------|

      TP = Agent gave correct and relevant answer
      FP = Agent gave answer but it was wrong or irrelevant
      FN = Agent missed giving a correct answer
      TN = Agent correctly did not give an answer when none was needed
    
Precision vs Recall tradeoff with examples

Precision is important when we want to avoid wrong answers. For example, a medical advice agent should only give answers it is sure about to avoid harm.

Recall is important when missing a correct answer is costly. For example, a customer support agent should try to answer all user questions, even if some answers are less certain.

Improving precision may lower recall and vice versa. The F1 score helps balance these two.

What good vs bad metric values look like for agent accuracy and relevance
  • Good: Precision and recall above 0.8 means the agent is mostly correct and finds most relevant answers.
  • Bad: Precision below 0.5 means many wrong answers. Recall below 0.5 means many correct answers are missed.
  • High accuracy but low recall means the agent is cautious but misses many opportunities to help.
  • High recall but low precision means the agent gives many answers but many are wrong or irrelevant.
Common pitfalls when measuring agent accuracy and relevance
  • Accuracy paradox: If the data is mostly one class (e.g., mostly no questions), accuracy can be high even if the agent never answers.
  • Data leakage: Testing the agent on data it has seen before inflates metrics falsely.
  • Overfitting: Agent performs well on training data but poorly on new questions.
  • Ignoring relevance: Measuring only correctness without checking if answers fit the user's intent.
Self-check question

Your agent has 98% accuracy but only 12% recall on important user questions. Is it good for production? Why or why not?

Answer: No, it is not good. The agent misses most important questions (low recall), so it fails to help users even if its few answers are mostly correct (high accuracy). Improving recall is critical.

Key Result
Precision, recall, and F1 score best measure agent accuracy and relevance by balancing correctness and coverage.

Practice

(1/5)
1. What does accuracy measure when evaluating an AI agent's answers?
easy
A. How many answers are related but not exact
B. How fast the agent responds
C. How many answers are exactly correct
D. How many answers are generated

Solution

  1. Step 1: Understand accuracy definition

    Accuracy counts the number of answers that match the correct ones exactly.
  2. Step 2: Compare with other metrics

    Relevance measures usefulness, not exact correctness, so it is different from accuracy.
  3. Final Answer:

    How many answers are exactly correct -> Option C
  4. Quick Check:

    Accuracy = exact correctness [OK]
Hint: Accuracy means exact right answers only [OK]
Common Mistakes:
  • Confusing accuracy with relevance
  • Thinking accuracy measures speed
  • Assuming accuracy counts all related answers
2. Which of the following is the correct way to calculate accuracy for an AI agent's answers?
easy
A. Number of related answers divided by total answers
B. Number of correct answers divided by total answers
C. Number of answers generated per second
D. Number of answers ignored by the agent

Solution

  1. Step 1: Recall accuracy formula

    Accuracy = (correct answers) / (total answers given).
  2. Step 2: Eliminate incorrect options

    Options about related answers or speed do not define accuracy.
  3. Final Answer:

    Number of correct answers divided by total answers -> Option B
  4. Quick Check:

    Accuracy = correct / total [OK]
Hint: Accuracy = correct answers รท total answers [OK]
Common Mistakes:
  • Using related answers count instead of correct
  • Mixing speed with accuracy
  • Ignoring total number of answers
3. Given an AI agent answered 80 questions, 60 were exactly correct, and 10 more were relevant but not exact. What is the accuracy and relevance percentage?
medium
A. Accuracy 60%, Relevance 70%
B. Accuracy 60%, Relevance 87.5%
C. Accuracy 75%, Relevance 60%
D. Accuracy 75%, Relevance 87.5%

Solution

  1. Step 1: Calculate accuracy percentage

    Accuracy = (60 correct / 80 total) * 100 = 75%.
  2. Step 2: Calculate relevance percentage

    Relevance = ((60 correct + 10 relevant) / 80 total) * 100 = 87.5%.
  3. Final Answer:

    Accuracy 75%, Relevance 87.5% -> Option D
  4. Quick Check:

    Accuracy = 75%, Relevance = 87.5% [OK]
Hint: Add relevant to correct for relevance % [OK]
Common Mistakes:
  • Mixing accuracy and relevance values
  • Not adding relevant answers for relevance
  • Dividing by wrong total number
4. An AI agent evaluation code snippet is below. It calculates accuracy but returns 0. What is the bug?
correct = 50
total = 0
accuracy = correct / total
print(accuracy)
medium
A. Division by zero error due to total being zero
B. Correct variable is zero, so accuracy is zero
C. Print statement syntax is wrong
D. Accuracy should be multiplied by 100

Solution

  1. Step 1: Identify variables and operation

    correct = 50, total = 0, accuracy = correct / total.
  2. Step 2: Check for division errors

    Dividing by zero (total=0) causes an error or invalid result.
  3. Final Answer:

    Division by zero error due to total being zero -> Option A
  4. Quick Check:

    Division by zero causes error [OK]
Hint: Check denominator is not zero before dividing [OK]
Common Mistakes:
  • Ignoring zero division error
  • Thinking print syntax is wrong
  • Assuming accuracy must be multiplied by 100
5. You want to improve an AI agent's trust by measuring both accuracy and relevance. Which approach best helps achieve this?
hard
A. Track exact correct answers and also count useful related answers
B. Only count answers that are exactly correct
C. Ignore relevance and focus on speed of answers
D. Count all answers regardless of correctness or relevance

Solution

  1. Step 1: Understand trust factors

    Trust improves when answers are both correct and useful (relevant).
  2. Step 2: Choose measurement approach

    Tracking both exact correctness (accuracy) and usefulness (relevance) gives a fuller picture.
  3. Final Answer:

    Track exact correct answers and also count useful related answers -> Option A
  4. Quick Check:

    Measure accuracy + relevance for trust [OK]
Hint: Measure both exact and useful answers for trust [OK]
Common Mistakes:
  • Focusing only on exact correctness
  • Ignoring relevance completely
  • Measuring speed instead of quality