Practice

(1/5)

1. What does RAG evaluation metrics primarily measure in a retrieval-augmented generation system?

easy

A. Both the quality of generated answers and the relevance of retrieved documents

B. Only the speed of document retrieval

C. The size of the training dataset

D. The number of layers in the neural network

Solution

Step 1: Understand RAG system components
RAG combines document retrieval and answer generation, so evaluation must cover both parts.
Step 2: Identify what metrics measure
Metrics check answer quality (like accuracy) and retrieval quality (like precision).
Final Answer:
Both the quality of generated answers and the relevance of retrieved documents -> Option A
Quick Check:
RAG metrics = answer + retrieval quality [OK]

Hint: RAG means check both answer and retrieval quality [OK]

Common Mistakes:

Thinking RAG only measures answer quality
Confusing retrieval speed with quality
Ignoring document relevance in evaluation

2. Which of the following is a common metric used to evaluate the retrieval part of a RAG system?

easy

A. Mean squared error

B. BLEU score

C. Cross-entropy loss

D. Retrieval precision

Solution

Step 1: Identify retrieval metrics
Retrieval precision measures how many retrieved documents are relevant.
Step 2: Match metric to retrieval
BLEU is for text generation, cross-entropy and MSE are loss functions, not retrieval metrics.
Final Answer:
Retrieval precision -> Option D
Quick Check:
Retrieval metric = precision [OK]

Hint: Precision measures retrieval relevance, not BLEU or loss [OK]

Common Mistakes:

Choosing BLEU which is for generation
Confusing loss functions with evaluation metrics
Ignoring retrieval-specific metrics

3. Consider this Python snippet evaluating a RAG model's answer quality using F1 score:

from sklearn.metrics import f1_score
true_answers = ["cat", "dog", "bird"]
pred_answers = ["cat", "dog", "cat"]
f1 = f1_score(true_answers, pred_answers, average='macro')
print(round(f1, 2))

What will be the output?

medium

A. Error due to string inputs

B. 0.75

C. 0.56

D. 1.00

Solution

Step 1: Verify f1_score handles strings
sklearn's f1_score supports string labels directly via internal encoding.
Step 2: Compute macro F1
Classes: 'bird', 'cat', 'dog'
• 'bird': F1 = 0 (TP=0, predicted 0 times)
• 'cat': prec=1/2=0.5, rec=1/1=1, F1=2×0.5×1/(0.5+1)=0.67
• 'dog': F1=1
Macro F1 = (0 + 0.67 + 1)/3 ≈ 0.5556, round(0.56, 2) = 0.56
Final Answer:
0.56 -> Option C
Quick Check:
macro F1 = (0 + 0.67 + 1)/3 = 0.56 [OK]

Hint: f1_score works on strings; macro F1=(0+0.67+1)/3=0.56 [OK]

Common Mistakes:

Computing micro F1 or accuracy (0.67)
Expecting error due to strings
Wrong per-class calculation (0.75)

4. You have this code snippet to compute retrieval precision but it gives wrong results:

retrieved_docs = ["doc1", "doc2", "doc3"]
relevant_docs = ["doc2", "doc4"]
precision = len(set(retrieved_docs) & set(relevant_docs)) / len(relevant_docs)
print(round(precision, 2))

What is the bug and how to fix it?

medium

A. Divide by len(retrieved_docs) instead of len(relevant_docs)

B. Use union instead of intersection in numerator

C. Convert lists to tuples before set operations

D. No bug, code is correct

Solution

Step 1: Understand precision formula
Precision = relevant retrieved / total retrieved, so denominator must be retrieved docs count.
Step 2: Identify denominator mistake
Code divides by len(relevant_docs), which is recall formula denominator.
Step 3: Fix denominator
Change denominator to len(retrieved_docs) to compute precision correctly.
Final Answer:
Divide by len(retrieved_docs) instead of len(relevant_docs) -> Option A
Quick Check:
Precision denominator = retrieved docs count [OK]

Hint: Precision divides by retrieved docs count, not relevant docs [OK]

Common Mistakes:

Mixing precision with recall formula
Using union instead of intersection
Ignoring set conversion issues

5. You want to evaluate a RAG model combining answer F1 score and retrieval precision into a single metric. Which approach is best to fairly combine these metrics?

hard

A. Add F1 score and retrieval precision directly

B. Calculate the harmonic mean of F1 score and retrieval precision

C. Use only the higher of the two scores

D. Multiply F1 score by retrieval precision without normalization

Solution

Step 1: Understand metric combination needs
Combining metrics requires balancing both scores fairly, avoiding dominance by one.
Step 2: Evaluate combination methods
Harmonic mean balances low and high values well; addition or multiplication can skew results.
Step 3: Choose harmonic mean
Harmonic mean is common for combining precision and recall, so it suits combining F1 and retrieval precision.
Final Answer:
Calculate the harmonic mean of F1 score and retrieval precision -> Option B
Quick Check:
Harmonic mean balances combined metrics [OK]

Hint: Use harmonic mean to balance combined metrics fairly [OK]

Common Mistakes:

Adding metrics without normalization
Ignoring metric scale differences
Choosing max score only

Why RAG evaluation metrics in Prompt Engineering / GenAI? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand RAG system components

Step 2: Identify what metrics measure

Final Answer:

Quick Check:

Solution

Step 1: Identify retrieval metrics

Step 2: Match metric to retrieval

Final Answer:

Quick Check:

Solution

Step 1: Verify f1_score handles strings

Step 2: Compute macro F1

Final Answer:

Quick Check:

Solution

Step 1: Understand precision formula

Step 2: Identify denominator mistake

Step 3: Fix denominator

Final Answer:

Quick Check:

Solution

Step 1: Understand metric combination needs

Step 2: Evaluate combination methods

Step 3: Choose harmonic mean

Final Answer:

Quick Check: