Agentic AIml~20 mins

Measuring agent accuracy and relevance in Agentic AI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Experiment - Measuring agent accuracy and relevance

Problem:You have built an AI agent that answers questions. The agent gives answers, but you want to check how accurate and relevant these answers are compared to the correct answers.

Current Metrics:Accuracy: 65%, Relevance score (based on human rating): 70%

Issue:The agent's accuracy and relevance are low, meaning it often gives wrong or not useful answers.

Your Task

Improve the agent's accuracy to at least 80% and relevance score to at least 85%.

You can only adjust the evaluation method and agent's response filtering.

You cannot change the agent's core model or training data.

Hint 1

Hint 2

Hint 3

Solution

Agentic AI

from sklearn.metrics import f1_score

# Sample true and predicted answers (1 for correct, 0 for incorrect)
true_answers = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
predicted_answers = [1, 0, 0, 1, 0, 1, 1, 0, 1, 0]

# Calculate F1-score for accuracy
accuracy_f1 = f1_score(true_answers, predicted_answers)

# Sample confidence scores for each predicted answer
confidence_scores = [0.9, 0.6, 0.4, 0.95, 0.5, 0.85, 0.3, 0.7, 0.9, 0.6]

# Filter answers with confidence >= 0.7
filtered_predictions = [pred if conf >= 0.7 else 0 for pred, conf in zip(predicted_answers, confidence_scores)]

# Calculate new F1-score after filtering
filtered_accuracy_f1 = f1_score(true_answers, filtered_predictions)

# Calculate relevance as percentage of filtered answers matching true answers
correct_filtered = sum(1 for t, p in zip(true_answers, filtered_predictions) if t == p and p != 0)
relevance_score = correct_filtered / sum(1 for p in filtered_predictions if p != 0) * 100 if sum(1 for p in filtered_predictions if p != 0) > 0 else 0

print(f"Original F1 Accuracy: {accuracy_f1:.2f}")
print(f"Filtered F1 Accuracy: {filtered_accuracy_f1:.2f}")
print(f"Relevance Score after filtering: {relevance_score:.2f}%")

Used F1-score instead of simple accuracy to better measure correctness.

Added confidence score filtering to remove low-confidence answers.

Calculated relevance as the percentage of filtered answers that are correct.

Results Interpretation

Before filtering: Accuracy (F1) was 0.80 (80%), Relevance was 70%.

After filtering low-confidence answers: Accuracy (F1) improved to 0.89 (89%), Relevance improved to 100%.

Filtering answers by confidence helps remove uncertain responses, improving both accuracy and relevance. Using F1-score gives a better balance between precision and recall than simple accuracy.

Bonus Experiment

Try using BLEU score to measure the quality of agent's text answers instead of binary correctness.

💡 Hint

BLEU compares the agent's answer text to reference answers by matching words and phrases, giving a score from 0 to 1.