Prompt Engineering / GenAIml~20 mins

RAG evaluation metrics in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

RAG Metrics Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding RAG's Retrieval and Generation Metrics

In Retrieval-Augmented Generation (RAG) models, which metric best measures how well the retrieval component finds relevant documents?

ABLEU score measuring text similarity between generated and reference answers

BPerplexity measuring the language model's uncertainty in generating text

CRecall@k measuring the fraction of relevant documents retrieved in the top k results

DF1 score measuring token overlap between generated and reference answers

Attempts:

2 left

❓ Metrics

intermediate

2:00remaining

Evaluating Generated Answers in RAG

Which metric is most appropriate to evaluate the quality of the generated answers in a RAG system compared to reference answers?

AROUGE-L measuring longest common subsequence overlap

BTop-1 accuracy of retrieved documents

CMean Squared Error between embeddings

DRecall@k for retrieval accuracy

Attempts:

2 left

❓ Predict Output

advanced

2:00remaining

Output of Recall@k Calculation Code

What is the output of this Python code that calculates Recall@3 for a RAG retrieval step?

Prompt Engineering / GenAI

relevant_docs = {101, 102, 103}
retrieved_docs = [104, 102, 105, 101, 106]
recall_at_3 = len(set(retrieved_docs[:3]) & relevant_docs) / len(relevant_docs)
print(round(recall_at_3, 2))

A1.0

B0.67

C0.0

D0.33

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Choosing a Metric for End-to-End RAG Evaluation

You want to evaluate a RAG model end-to-end, considering both retrieval and generation quality. Which combined metric approach is best?

AUse Recall@k for retrieval and ROUGE-L for generation, then average the two scores

BUse only BLEU score on generated answers ignoring retrieval

CUse perplexity of the generation model only

DUse accuracy of retrieved documents only

Attempts:

2 left

🔧 Debug

expert

3:00remaining

Debugging Incorrect Recall@k Calculation

Given this code snippet for Recall@5, what is the bug causing incorrect recall calculation?

relevant = [10, 20, 30]
retrieved = [20, 10, 40, 50, 60]
recall = len(set(retrieved[:5]) & set(relevant)) / len(retrieved[:5])
print(round(recall, 2))

ARecall calculation is correct, no bug

BRecall denominator should be number of relevant documents, not retrieved documents

CRecall numerator should count all retrieved documents, not intersection

DRecall should use union of sets, not intersection

Attempts:

2 left

Practice

(1/5)

1. What does RAG evaluation metrics primarily measure in a retrieval-augmented generation system?

easy

A. Both the quality of generated answers and the relevance of retrieved documents

B. Only the speed of document retrieval

C. The size of the training dataset

D. The number of layers in the neural network

RAG evaluation metrics in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand RAG system components

Step 2: Identify what metrics measure

Final Answer:

Quick Check:

Solution

Step 1: Identify retrieval metrics

Step 2: Match metric to retrieval

Final Answer:

Quick Check:

Solution

Step 1: Verify f1_score handles strings

Step 2: Compute macro F1

Final Answer:

Quick Check:

Solution

Step 1: Understand precision formula

Step 2: Identify denominator mistake

Step 3: Fix denominator

Final Answer:

Quick Check:

Solution

Step 1: Understand metric combination needs

Step 2: Evaluate combination methods

Step 3: Choose harmonic mean

Final Answer:

Quick Check: