0
0
Agentic AIml~8 mins

Measuring agent accuracy and relevance in Agentic AI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Measuring agent accuracy and relevance
Which metric matters for measuring agent accuracy and relevance and WHY

When we measure how well an agent performs, two key ideas matter: accuracy and relevance.

Accuracy tells us how often the agent's answers or actions are correct. It is important because it shows if the agent is reliable.

Relevance shows if the agent's responses fit the user's needs or questions well. Even if an answer is correct, it might not be useful if it is not relevant.

To measure these, we use metrics like Precision, Recall, and F1 score. Precision tells us how many of the agent's positive answers were truly correct. Recall tells us how many of the true correct answers the agent found. F1 score balances both.

For agents, relevance can also be measured by user feedback or similarity scores comparing the agent's output to expected results.

Confusion matrix for agent accuracy
      |---------------------------|
      |           | Predicted     |
      | Actual    | Correct | Wrong |
      |-----------|---------|-------|
      | Correct   |   TP    |  FN   |
      | Wrong     |   FP    |  TN   |
      |---------------------------|

      TP = Agent gave correct and relevant answer
      FP = Agent gave answer but it was wrong or irrelevant
      FN = Agent missed giving a correct answer
      TN = Agent correctly did not give an answer when none was needed
    
Precision vs Recall tradeoff with examples

Precision is important when we want to avoid wrong answers. For example, a medical advice agent should only give answers it is sure about to avoid harm.

Recall is important when missing a correct answer is costly. For example, a customer support agent should try to answer all user questions, even if some answers are less certain.

Improving precision may lower recall and vice versa. The F1 score helps balance these two.

What good vs bad metric values look like for agent accuracy and relevance
  • Good: Precision and recall above 0.8 means the agent is mostly correct and finds most relevant answers.
  • Bad: Precision below 0.5 means many wrong answers. Recall below 0.5 means many correct answers are missed.
  • High accuracy but low recall means the agent is cautious but misses many opportunities to help.
  • High recall but low precision means the agent gives many answers but many are wrong or irrelevant.
Common pitfalls when measuring agent accuracy and relevance
  • Accuracy paradox: If the data is mostly one class (e.g., mostly no questions), accuracy can be high even if the agent never answers.
  • Data leakage: Testing the agent on data it has seen before inflates metrics falsely.
  • Overfitting: Agent performs well on training data but poorly on new questions.
  • Ignoring relevance: Measuring only correctness without checking if answers fit the user's intent.
Self-check question

Your agent has 98% accuracy but only 12% recall on important user questions. Is it good for production? Why or why not?

Answer: No, it is not good. The agent misses most important questions (low recall), so it fails to help users even if its few answers are mostly correct (high accuracy). Improving recall is critical.

Key Result
Precision, recall, and F1 score best measure agent accuracy and relevance by balancing correctness and coverage.