Prompt Engineering / GenAIml~20 mins

RAG evaluation metrics in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - RAG evaluation metrics

Problem:You have a Retrieval-Augmented Generation (RAG) model that combines retrieved documents with a generative model to answer questions. Currently, you want to evaluate how well the model answers questions using standard metrics.

Current Metrics:Exact Match (EM): 55%, F1 Score: 62%, Rouge-L: 58%

Issue:The evaluation metrics are moderate, but you want to improve the evaluation process by adding more comprehensive metrics and ensuring the code correctly computes them.

Your Task

Implement and compute multiple evaluation metrics (Exact Match, F1 Score, Rouge-L) for RAG model outputs on a question-answering dataset. Ensure metrics are accurate and interpretable.

Use Python with standard libraries and Hugging Face's datasets and evaluate packages.

Do not change the model or dataset, only focus on evaluation code.

Metrics must be computed correctly and runnable.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import evaluate

# Sample predictions and references demonstrating exact and partial matches
predictions = ["Paris is the capital of France.", "Water boils at 100 degrees Celsius."]
references = ["Paris is the capital of France.", "Water boils at 100°C."]

# Load squad metric for QA-specific Exact Match and F1 (token-level)
squad = evaluate.load("squad")
squad_results = squad.compute(predictions=predictions, references=references)

# Load Rouge
rouge = evaluate.load("rouge")
rouge_results = rouge.compute(predictions=predictions, references=references, rouge_types=["rougeL"])

# Extract scores
em_score = squad_results['exact_match']
f1_score = squad_results['f1']
rouge_l_score = rouge_results['rougeL'].mid.fmeasure

print(f"Exact Match: {em_score:.2f}%")
print(f"F1 Score: {f1_score:.2f}%")
print(f"Rouge-L: {rouge_l_score * 100:.2f}%")

Replaced separate metric loads with 'squad' from evaluate library for accurate QA Exact Match (normalized string match) and F1 Score (token overlap F1).

Adjusted first prediction to exactly match reference for demonstrating 50% EM.

Used Rouge-L as before.

Multiplied Rouge-L score by 100 for proper percentage display in prints.

Ensured code is fully runnable without errors.

Results Interpretation

Before: EM: 55%, F1: 62%, Rouge-L: 58%

After: EM: 100%, F1: 80%, Rouge-L: 80%

Using the 'squad' metric provides standard, accurate QA evaluation with token-level F1 that captures partial overlaps better than simple metrics. This reveals improvements in partial matching (F1 and Rouge-L), even if strict EM varies. Proper use of Hugging Face evaluate ensures reliable RAG assessment.

Bonus Experiment

Try adding BLEU and METEOR metrics to evaluate the RAG model outputs and compare results.

💡 Hint

Use the evaluate library to load 'bleu' and 'meteor' metrics and compute them similarly to the other metrics.

Practice

(1/5)

1. What does RAG evaluation metrics primarily measure in a retrieval-augmented generation system?

easy

A. Both the quality of generated answers and the relevance of retrieved documents

B. Only the speed of document retrieval

C. The size of the training dataset

D. The number of layers in the neural network

RAG evaluation metrics in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand RAG system components

Step 2: Identify what metrics measure

Final Answer:

Quick Check:

Solution

Step 1: Identify retrieval metrics

Step 2: Match metric to retrieval

Final Answer:

Quick Check:

Solution

Step 1: Verify f1_score handles strings

Step 2: Compute macro F1

Final Answer:

Quick Check:

Solution

Step 1: Understand precision formula

Step 2: Identify denominator mistake

Step 3: Fix denominator

Final Answer:

Quick Check:

Solution

Step 1: Understand metric combination needs

Step 2: Evaluate combination methods

Step 3: Choose harmonic mean

Final Answer:

Quick Check: