0
0
Prompt Engineering / GenAIml~20 mins

RAG evaluation metrics in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
RAG Metrics Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding RAG's Retrieval and Generation Metrics

In Retrieval-Augmented Generation (RAG) models, which metric best measures how well the retrieval component finds relevant documents?

ABLEU score measuring text similarity between generated and reference answers
BPerplexity measuring the language model's uncertainty in generating text
CRecall@k measuring the fraction of relevant documents retrieved in the top k results
DF1 score measuring token overlap between generated and reference answers
Attempts:
2 left
💡 Hint

Think about which metric evaluates the retrieval step's ability to find relevant documents.

Metrics
intermediate
2:00remaining
Evaluating Generated Answers in RAG

Which metric is most appropriate to evaluate the quality of the generated answers in a RAG system compared to reference answers?

AROUGE-L measuring longest common subsequence overlap
BTop-1 accuracy of retrieved documents
CMean Squared Error between embeddings
DRecall@k for retrieval accuracy
Attempts:
2 left
💡 Hint

Focus on metrics that compare generated text to reference text.

Predict Output
advanced
2:00remaining
Output of Recall@k Calculation Code

What is the output of this Python code that calculates Recall@3 for a RAG retrieval step?

Prompt Engineering / GenAI
relevant_docs = {101, 102, 103}
retrieved_docs = [104, 102, 105, 101, 106]
recall_at_3 = len(set(retrieved_docs[:3]) & relevant_docs) / len(relevant_docs)
print(round(recall_at_3, 2))
A1.0
B0.67
C0.0
D0.33
Attempts:
2 left
💡 Hint

Check which relevant documents appear in the first 3 retrieved docs.

Model Choice
advanced
2:00remaining
Choosing a Metric for End-to-End RAG Evaluation

You want to evaluate a RAG model end-to-end, considering both retrieval and generation quality. Which combined metric approach is best?

AUse Recall@k for retrieval and ROUGE-L for generation, then average the two scores
BUse only BLEU score on generated answers ignoring retrieval
CUse perplexity of the generation model only
DUse accuracy of retrieved documents only
Attempts:
2 left
💡 Hint

Think about combining retrieval and generation metrics for full evaluation.

🔧 Debug
expert
3:00remaining
Debugging Incorrect Recall@k Calculation

Given this code snippet for Recall@5, what is the bug causing incorrect recall calculation?

relevant = [10, 20, 30]
retrieved = [20, 10, 40, 50, 60]
recall = len(set(retrieved[:5]) & set(relevant)) / len(retrieved[:5])
print(round(recall, 2))
ARecall calculation is correct, no bug
BRecall denominator should be number of relevant documents, not retrieved documents
CRecall numerator should count all retrieved documents, not intersection
DRecall should use union of sets, not intersection
Attempts:
2 left
💡 Hint

Recall measures fraction of relevant items retrieved, so denominator matters.