Bird
Raised Fist0
Agentic AIml~20 mins

Measuring agent accuracy and relevance in Agentic AI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Measuring agent accuracy and relevance
Problem:You have built an AI agent that answers questions. The agent gives answers, but you want to check how accurate and relevant these answers are compared to the correct answers.
Current Metrics:Accuracy: 65%, Relevance score (based on human rating): 70%
Issue:The agent's accuracy and relevance are low, meaning it often gives wrong or not useful answers.
Your Task
Improve the agent's accuracy to at least 80% and relevance score to at least 85%.
You can only adjust the evaluation method and agent's response filtering.
You cannot change the agent's core model or training data.
Hint 1
Hint 2
Hint 3
Solution
Agentic AI
from sklearn.metrics import f1_score

# Sample true and predicted answers (1 for correct, 0 for incorrect)
true_answers = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
predicted_answers = [1, 0, 0, 1, 0, 1, 1, 0, 1, 0]

# Calculate F1-score for accuracy
accuracy_f1 = f1_score(true_answers, predicted_answers)

# Sample confidence scores for each predicted answer
confidence_scores = [0.9, 0.6, 0.4, 0.95, 0.5, 0.85, 0.3, 0.7, 0.9, 0.6]

# Filter answers with confidence >= 0.7
filtered_predictions = [pred if conf >= 0.7 else 0 for pred, conf in zip(predicted_answers, confidence_scores)]

# Calculate new F1-score after filtering
filtered_accuracy_f1 = f1_score(true_answers, filtered_predictions)

# Calculate relevance as percentage of filtered answers matching true answers
correct_filtered = sum(1 for t, p in zip(true_answers, filtered_predictions) if t == p and p != 0)
relevance_score = correct_filtered / sum(1 for p in filtered_predictions if p != 0) * 100 if sum(1 for p in filtered_predictions if p != 0) > 0 else 0

print(f"Original F1 Accuracy: {accuracy_f1:.2f}")
print(f"Filtered F1 Accuracy: {filtered_accuracy_f1:.2f}")
print(f"Relevance Score after filtering: {relevance_score:.2f}%")
Used F1-score instead of simple accuracy to better measure correctness.
Added confidence score filtering to remove low-confidence answers.
Calculated relevance as the percentage of filtered answers that are correct.
Results Interpretation

Before filtering: Accuracy (F1) was 0.80 (80%), Relevance was 70%.

After filtering low-confidence answers: Accuracy (F1) improved to 0.89 (89%), Relevance improved to 100%.

Filtering answers by confidence helps remove uncertain responses, improving both accuracy and relevance. Using F1-score gives a better balance between precision and recall than simple accuracy.
Bonus Experiment
Try using BLEU score to measure the quality of agent's text answers instead of binary correctness.
๐Ÿ’ก Hint
BLEU compares the agent's answer text to reference answers by matching words and phrases, giving a score from 0 to 1.

Practice

(1/5)
1. What does accuracy measure when evaluating an AI agent's answers?
easy
A. How many answers are related but not exact
B. How fast the agent responds
C. How many answers are exactly correct
D. How many answers are generated

Solution

  1. Step 1: Understand accuracy definition

    Accuracy counts the number of answers that match the correct ones exactly.
  2. Step 2: Compare with other metrics

    Relevance measures usefulness, not exact correctness, so it is different from accuracy.
  3. Final Answer:

    How many answers are exactly correct -> Option C
  4. Quick Check:

    Accuracy = exact correctness [OK]
Hint: Accuracy means exact right answers only [OK]
Common Mistakes:
  • Confusing accuracy with relevance
  • Thinking accuracy measures speed
  • Assuming accuracy counts all related answers
2. Which of the following is the correct way to calculate accuracy for an AI agent's answers?
easy
A. Number of related answers divided by total answers
B. Number of correct answers divided by total answers
C. Number of answers generated per second
D. Number of answers ignored by the agent

Solution

  1. Step 1: Recall accuracy formula

    Accuracy = (correct answers) / (total answers given).
  2. Step 2: Eliminate incorrect options

    Options about related answers or speed do not define accuracy.
  3. Final Answer:

    Number of correct answers divided by total answers -> Option B
  4. Quick Check:

    Accuracy = correct / total [OK]
Hint: Accuracy = correct answers รท total answers [OK]
Common Mistakes:
  • Using related answers count instead of correct
  • Mixing speed with accuracy
  • Ignoring total number of answers
3. Given an AI agent answered 80 questions, 60 were exactly correct, and 10 more were relevant but not exact. What is the accuracy and relevance percentage?
medium
A. Accuracy 60%, Relevance 70%
B. Accuracy 60%, Relevance 87.5%
C. Accuracy 75%, Relevance 60%
D. Accuracy 75%, Relevance 87.5%

Solution

  1. Step 1: Calculate accuracy percentage

    Accuracy = (60 correct / 80 total) * 100 = 75%.
  2. Step 2: Calculate relevance percentage

    Relevance = ((60 correct + 10 relevant) / 80 total) * 100 = 87.5%.
  3. Final Answer:

    Accuracy 75%, Relevance 87.5% -> Option D
  4. Quick Check:

    Accuracy = 75%, Relevance = 87.5% [OK]
Hint: Add relevant to correct for relevance % [OK]
Common Mistakes:
  • Mixing accuracy and relevance values
  • Not adding relevant answers for relevance
  • Dividing by wrong total number
4. An AI agent evaluation code snippet is below. It calculates accuracy but returns 0. What is the bug?
correct = 50
total = 0
accuracy = correct / total
print(accuracy)
medium
A. Division by zero error due to total being zero
B. Correct variable is zero, so accuracy is zero
C. Print statement syntax is wrong
D. Accuracy should be multiplied by 100

Solution

  1. Step 1: Identify variables and operation

    correct = 50, total = 0, accuracy = correct / total.
  2. Step 2: Check for division errors

    Dividing by zero (total=0) causes an error or invalid result.
  3. Final Answer:

    Division by zero error due to total being zero -> Option A
  4. Quick Check:

    Division by zero causes error [OK]
Hint: Check denominator is not zero before dividing [OK]
Common Mistakes:
  • Ignoring zero division error
  • Thinking print syntax is wrong
  • Assuming accuracy must be multiplied by 100
5. You want to improve an AI agent's trust by measuring both accuracy and relevance. Which approach best helps achieve this?
hard
A. Track exact correct answers and also count useful related answers
B. Only count answers that are exactly correct
C. Ignore relevance and focus on speed of answers
D. Count all answers regardless of correctness or relevance

Solution

  1. Step 1: Understand trust factors

    Trust improves when answers are both correct and useful (relevant).
  2. Step 2: Choose measurement approach

    Tracking both exact correctness (accuracy) and usefulness (relevance) gives a fuller picture.
  3. Final Answer:

    Track exact correct answers and also count useful related answers -> Option A
  4. Quick Check:

    Measure accuracy + relevance for trust [OK]
Hint: Measure both exact and useful answers for trust [OK]
Common Mistakes:
  • Focusing only on exact correctness
  • Ignoring relevance completely
  • Measuring speed instead of quality