Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

RAG evaluation metrics in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
RAG Metrics Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding RAG's Retrieval and Generation Metrics

In Retrieval-Augmented Generation (RAG) models, which metric best measures how well the retrieval component finds relevant documents?

ABLEU score measuring text similarity between generated and reference answers
BPerplexity measuring the language model's uncertainty in generating text
CRecall@k measuring the fraction of relevant documents retrieved in the top k results
DF1 score measuring token overlap between generated and reference answers
Attempts:
2 left
💡 Hint

Think about which metric evaluates the retrieval step's ability to find relevant documents.

Metrics
intermediate
2:00remaining
Evaluating Generated Answers in RAG

Which metric is most appropriate to evaluate the quality of the generated answers in a RAG system compared to reference answers?

AROUGE-L measuring longest common subsequence overlap
BTop-1 accuracy of retrieved documents
CMean Squared Error between embeddings
DRecall@k for retrieval accuracy
Attempts:
2 left
💡 Hint

Focus on metrics that compare generated text to reference text.

Predict Output
advanced
2:00remaining
Output of Recall@k Calculation Code

What is the output of this Python code that calculates Recall@3 for a RAG retrieval step?

Prompt Engineering / GenAI
relevant_docs = {101, 102, 103}
retrieved_docs = [104, 102, 105, 101, 106]
recall_at_3 = len(set(retrieved_docs[:3]) & relevant_docs) / len(relevant_docs)
print(round(recall_at_3, 2))
A1.0
B0.67
C0.0
D0.33
Attempts:
2 left
💡 Hint

Check which relevant documents appear in the first 3 retrieved docs.

Model Choice
advanced
2:00remaining
Choosing a Metric for End-to-End RAG Evaluation

You want to evaluate a RAG model end-to-end, considering both retrieval and generation quality. Which combined metric approach is best?

AUse Recall@k for retrieval and ROUGE-L for generation, then average the two scores
BUse only BLEU score on generated answers ignoring retrieval
CUse perplexity of the generation model only
DUse accuracy of retrieved documents only
Attempts:
2 left
💡 Hint

Think about combining retrieval and generation metrics for full evaluation.

🔧 Debug
expert
3:00remaining
Debugging Incorrect Recall@k Calculation

Given this code snippet for Recall@5, what is the bug causing incorrect recall calculation?

relevant = [10, 20, 30]
retrieved = [20, 10, 40, 50, 60]
recall = len(set(retrieved[:5]) & set(relevant)) / len(retrieved[:5])
print(round(recall, 2))
ARecall calculation is correct, no bug
BRecall denominator should be number of relevant documents, not retrieved documents
CRecall numerator should count all retrieved documents, not intersection
DRecall should use union of sets, not intersection
Attempts:
2 left
💡 Hint

Recall measures fraction of relevant items retrieved, so denominator matters.

Practice

(1/5)
1. What does RAG evaluation metrics primarily measure in a retrieval-augmented generation system?
easy
A. Both the quality of generated answers and the relevance of retrieved documents
B. Only the speed of document retrieval
C. The size of the training dataset
D. The number of layers in the neural network

Solution

  1. Step 1: Understand RAG system components

    RAG combines document retrieval and answer generation, so evaluation must cover both parts.
  2. Step 2: Identify what metrics measure

    Metrics check answer quality (like accuracy) and retrieval quality (like precision).
  3. Final Answer:

    Both the quality of generated answers and the relevance of retrieved documents -> Option A
  4. Quick Check:

    RAG metrics = answer + retrieval quality [OK]
Hint: RAG means check both answer and retrieval quality [OK]
Common Mistakes:
  • Thinking RAG only measures answer quality
  • Confusing retrieval speed with quality
  • Ignoring document relevance in evaluation
2. Which of the following is a common metric used to evaluate the retrieval part of a RAG system?
easy
A. Mean squared error
B. BLEU score
C. Cross-entropy loss
D. Retrieval precision

Solution

  1. Step 1: Identify retrieval metrics

    Retrieval precision measures how many retrieved documents are relevant.
  2. Step 2: Match metric to retrieval

    BLEU is for text generation, cross-entropy and MSE are loss functions, not retrieval metrics.
  3. Final Answer:

    Retrieval precision -> Option D
  4. Quick Check:

    Retrieval metric = precision [OK]
Hint: Precision measures retrieval relevance, not BLEU or loss [OK]
Common Mistakes:
  • Choosing BLEU which is for generation
  • Confusing loss functions with evaluation metrics
  • Ignoring retrieval-specific metrics
3. Consider this Python snippet evaluating a RAG model's answer quality using F1 score:
from sklearn.metrics import f1_score
true_answers = ["cat", "dog", "bird"]
pred_answers = ["cat", "dog", "cat"]
f1 = f1_score(true_answers, pred_answers, average='macro')
print(round(f1, 2))
What will be the output?
medium
A. Error due to string inputs
B. 0.75
C. 0.56
D. 1.00

Solution

  1. Step 1: Verify f1_score handles strings

    sklearn's f1_score supports string labels directly via internal encoding.
  2. Step 2: Compute macro F1

    Classes: 'bird', 'cat', 'dog'
    • 'bird': F1 = 0 (TP=0, predicted 0 times)
    • 'cat': prec=1/2=0.5, rec=1/1=1, F1=2×0.5×1/(0.5+1)=0.67
    • 'dog': F1=1
    Macro F1 = (0 + 0.67 + 1)/3 ≈ 0.5556, round(0.56, 2) = 0.56
  3. Final Answer:

    0.56 -> Option C
  4. Quick Check:

    macro F1 = (0 + 0.67 + 1)/3 = 0.56 [OK]
Hint: f1_score works on strings; macro F1=(0+0.67+1)/3=0.56 [OK]
Common Mistakes:
  • Computing micro F1 or accuracy (0.67)
  • Expecting error due to strings
  • Wrong per-class calculation (0.75)
4. You have this code snippet to compute retrieval precision but it gives wrong results:
retrieved_docs = ["doc1", "doc2", "doc3"]
relevant_docs = ["doc2", "doc4"]
precision = len(set(retrieved_docs) & set(relevant_docs)) / len(relevant_docs)
print(round(precision, 2))
What is the bug and how to fix it?
medium
A. Divide by len(retrieved_docs) instead of len(relevant_docs)
B. Use union instead of intersection in numerator
C. Convert lists to tuples before set operations
D. No bug, code is correct

Solution

  1. Step 1: Understand precision formula

    Precision = relevant retrieved / total retrieved, so denominator must be retrieved docs count.
  2. Step 2: Identify denominator mistake

    Code divides by len(relevant_docs), which is recall formula denominator.
  3. Step 3: Fix denominator

    Change denominator to len(retrieved_docs) to compute precision correctly.
  4. Final Answer:

    Divide by len(retrieved_docs) instead of len(relevant_docs) -> Option A
  5. Quick Check:

    Precision denominator = retrieved docs count [OK]
Hint: Precision divides by retrieved docs count, not relevant docs [OK]
Common Mistakes:
  • Mixing precision with recall formula
  • Using union instead of intersection
  • Ignoring set conversion issues
5. You want to evaluate a RAG model combining answer F1 score and retrieval precision into a single metric. Which approach is best to fairly combine these metrics?
hard
A. Add F1 score and retrieval precision directly
B. Calculate the harmonic mean of F1 score and retrieval precision
C. Use only the higher of the two scores
D. Multiply F1 score by retrieval precision without normalization

Solution

  1. Step 1: Understand metric combination needs

    Combining metrics requires balancing both scores fairly, avoiding dominance by one.
  2. Step 2: Evaluate combination methods

    Harmonic mean balances low and high values well; addition or multiplication can skew results.
  3. Step 3: Choose harmonic mean

    Harmonic mean is common for combining precision and recall, so it suits combining F1 and retrieval precision.
  4. Final Answer:

    Calculate the harmonic mean of F1 score and retrieval precision -> Option B
  5. Quick Check:

    Harmonic mean balances combined metrics [OK]
Hint: Use harmonic mean to balance combined metrics fairly [OK]
Common Mistakes:
  • Adding metrics without normalization
  • Ignoring metric scale differences
  • Choosing max score only