Bird
Raised Fist0
Prompt Engineering / GenAIml~6 mins

RAG evaluation metrics in Prompt Engineering / GenAI - Full Explanation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
When building systems that find and use information from large collections, it is important to know how well they work. RAG evaluation metrics help us measure how good these systems are at finding the right information and giving useful answers.
Explanation
Recall
Recall measures how many of the relevant pieces of information the system successfully finds. It focuses on completeness, showing if the system misses important facts. A high recall means the system finds most of what it should.
Recall tells us how much relevant information the system does not miss.
Precision
Precision measures how many of the pieces of information the system finds are actually relevant. It focuses on accuracy, showing if the system avoids giving wrong or unrelated facts. A high precision means the system’s answers are mostly correct.
Precision tells us how accurate the system’s found information is.
F1 Score
F1 Score combines recall and precision into one number by balancing them. It helps us understand the overall quality of the system’s information retrieval and answer generation. A high F1 score means the system is both accurate and complete.
F1 Score balances recall and precision to show overall performance.
Exact Match
Exact Match checks if the system’s answer exactly matches the correct answer. It is a strict measure that does not allow any differences. This metric is useful when precise answers are needed.
Exact Match measures if the answer is exactly right without any changes.
ROUGE and BLEU Scores
ROUGE and BLEU are metrics that compare the system’s generated text to reference answers by looking at overlapping words or phrases. They help measure how similar the generated answer is to the expected one, useful for evaluating text quality.
ROUGE and BLEU measure how closely generated text matches reference answers.
Real World Analogy

Imagine you are looking for specific books in a large library. Recall is like how many of the books you wanted you actually find on the shelves. Precision is like how many of the books you picked up are actually the ones you wanted, not random or wrong books. F1 Score is like a score that tells you how good you are at both finding and picking the right books.

Recall → Finding most of the books you wanted in the library
Precision → Picking only the books you wanted without mistakes
F1 Score → Overall score of how well you found and picked the right books
Exact Match → Picking a book that exactly matches the title you wanted
ROUGE and BLEU Scores → Comparing your book summary to the official summary to see how similar they are
Diagram
Diagram
┌─────────────┐       ┌─────────────┐       ┌─────────────┐
│  Recall     │──────▶│  F1 Score   │◀──────│  Precision  │
└─────────────┘       └─────────────┘       └─────────────┘
       │                                         │
       ▼                                         ▼
┌─────────────┐                           ┌─────────────┐
│Exact Match  │                           │ROUGE & BLEU │
└─────────────┘                           └─────────────┘
Diagram showing how Recall and Precision combine into F1 Score, with Exact Match and ROUGE/BLEU as additional evaluation metrics.
Key Facts
RecallMeasures the proportion of relevant information found by the system.
PrecisionMeasures the proportion of found information that is relevant.
F1 ScoreHarmonic mean of recall and precision showing overall accuracy and completeness.
Exact MatchChecks if the system’s answer exactly matches the correct answer.
ROUGE ScoreEvaluates overlap of words and phrases between generated and reference texts.
BLEU ScoreMeasures similarity of generated text to reference text based on matching n-grams.
Common Confusions
Thinking high recall means the system is always good.
Thinking high recall means the system is always good. High recall alone can mean the system finds many relevant items but may also include many irrelevant ones, so precision must also be considered.
Believing precision and recall measure the same thing.
Believing precision and recall measure the same thing. Precision measures accuracy of found items, while recall measures completeness; they focus on different aspects of performance.
Assuming Exact Match allows partial credit.
Assuming Exact Match allows partial credit. Exact Match requires the answer to be completely correct with no differences; partial matches do not count.
Summary
Recall and precision are key metrics that measure completeness and accuracy of information retrieval in RAG systems.
F1 Score balances recall and precision to give an overall performance measure.
Exact Match and ROUGE/BLEU scores help evaluate the quality and correctness of generated answers.

Practice

(1/5)
1. What does RAG evaluation metrics primarily measure in a retrieval-augmented generation system?
easy
A. Both the quality of generated answers and the relevance of retrieved documents
B. Only the speed of document retrieval
C. The size of the training dataset
D. The number of layers in the neural network

Solution

  1. Step 1: Understand RAG system components

    RAG combines document retrieval and answer generation, so evaluation must cover both parts.
  2. Step 2: Identify what metrics measure

    Metrics check answer quality (like accuracy) and retrieval quality (like precision).
  3. Final Answer:

    Both the quality of generated answers and the relevance of retrieved documents -> Option A
  4. Quick Check:

    RAG metrics = answer + retrieval quality [OK]
Hint: RAG means check both answer and retrieval quality [OK]
Common Mistakes:
  • Thinking RAG only measures answer quality
  • Confusing retrieval speed with quality
  • Ignoring document relevance in evaluation
2. Which of the following is a common metric used to evaluate the retrieval part of a RAG system?
easy
A. Mean squared error
B. BLEU score
C. Cross-entropy loss
D. Retrieval precision

Solution

  1. Step 1: Identify retrieval metrics

    Retrieval precision measures how many retrieved documents are relevant.
  2. Step 2: Match metric to retrieval

    BLEU is for text generation, cross-entropy and MSE are loss functions, not retrieval metrics.
  3. Final Answer:

    Retrieval precision -> Option D
  4. Quick Check:

    Retrieval metric = precision [OK]
Hint: Precision measures retrieval relevance, not BLEU or loss [OK]
Common Mistakes:
  • Choosing BLEU which is for generation
  • Confusing loss functions with evaluation metrics
  • Ignoring retrieval-specific metrics
3. Consider this Python snippet evaluating a RAG model's answer quality using F1 score:
from sklearn.metrics import f1_score
true_answers = ["cat", "dog", "bird"]
pred_answers = ["cat", "dog", "cat"]
f1 = f1_score(true_answers, pred_answers, average='macro')
print(round(f1, 2))
What will be the output?
medium
A. Error due to string inputs
B. 0.75
C. 0.56
D. 1.00

Solution

  1. Step 1: Verify f1_score handles strings

    sklearn's f1_score supports string labels directly via internal encoding.
  2. Step 2: Compute macro F1

    Classes: 'bird', 'cat', 'dog'
    • 'bird': F1 = 0 (TP=0, predicted 0 times)
    • 'cat': prec=1/2=0.5, rec=1/1=1, F1=2×0.5×1/(0.5+1)=0.67
    • 'dog': F1=1
    Macro F1 = (0 + 0.67 + 1)/3 ≈ 0.5556, round(0.56, 2) = 0.56
  3. Final Answer:

    0.56 -> Option C
  4. Quick Check:

    macro F1 = (0 + 0.67 + 1)/3 = 0.56 [OK]
Hint: f1_score works on strings; macro F1=(0+0.67+1)/3=0.56 [OK]
Common Mistakes:
  • Computing micro F1 or accuracy (0.67)
  • Expecting error due to strings
  • Wrong per-class calculation (0.75)
4. You have this code snippet to compute retrieval precision but it gives wrong results:
retrieved_docs = ["doc1", "doc2", "doc3"]
relevant_docs = ["doc2", "doc4"]
precision = len(set(retrieved_docs) & set(relevant_docs)) / len(relevant_docs)
print(round(precision, 2))
What is the bug and how to fix it?
medium
A. Divide by len(retrieved_docs) instead of len(relevant_docs)
B. Use union instead of intersection in numerator
C. Convert lists to tuples before set operations
D. No bug, code is correct

Solution

  1. Step 1: Understand precision formula

    Precision = relevant retrieved / total retrieved, so denominator must be retrieved docs count.
  2. Step 2: Identify denominator mistake

    Code divides by len(relevant_docs), which is recall formula denominator.
  3. Step 3: Fix denominator

    Change denominator to len(retrieved_docs) to compute precision correctly.
  4. Final Answer:

    Divide by len(retrieved_docs) instead of len(relevant_docs) -> Option A
  5. Quick Check:

    Precision denominator = retrieved docs count [OK]
Hint: Precision divides by retrieved docs count, not relevant docs [OK]
Common Mistakes:
  • Mixing precision with recall formula
  • Using union instead of intersection
  • Ignoring set conversion issues
5. You want to evaluate a RAG model combining answer F1 score and retrieval precision into a single metric. Which approach is best to fairly combine these metrics?
hard
A. Add F1 score and retrieval precision directly
B. Calculate the harmonic mean of F1 score and retrieval precision
C. Use only the higher of the two scores
D. Multiply F1 score by retrieval precision without normalization

Solution

  1. Step 1: Understand metric combination needs

    Combining metrics requires balancing both scores fairly, avoiding dominance by one.
  2. Step 2: Evaluate combination methods

    Harmonic mean balances low and high values well; addition or multiplication can skew results.
  3. Step 3: Choose harmonic mean

    Harmonic mean is common for combining precision and recall, so it suits combining F1 and retrieval precision.
  4. Final Answer:

    Calculate the harmonic mean of F1 score and retrieval precision -> Option B
  5. Quick Check:

    Harmonic mean balances combined metrics [OK]
Hint: Use harmonic mean to balance combined metrics fairly [OK]
Common Mistakes:
  • Adding metrics without normalization
  • Ignoring metric scale differences
  • Choosing max score only