Bird
Raised Fist0
Prompt Engineering / GenAIml~5 mins

RAG evaluation metrics in Prompt Engineering / GenAI - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What does RAG stand for in machine learning?
RAG stands for Retrieval-Augmented Generation, a method combining retrieval of documents with text generation.
Click to reveal answer
beginner
Why do we need evaluation metrics for RAG models?
Evaluation metrics help us measure how well the RAG model retrieves relevant information and generates accurate, useful answers.
Click to reveal answer
intermediate
Name two common metrics used to evaluate the retrieval part of RAG.
Recall@k and Precision@k are common metrics to check if the model retrieves relevant documents within the top k results.
Click to reveal answer
intermediate
What metric measures the quality of generated text in RAG?
BLEU, ROUGE, and METEOR are popular metrics to compare generated text with reference answers to measure quality.
Click to reveal answer
beginner
How does Exact Match (EM) metric work in RAG evaluation?
Exact Match checks if the generated answer exactly matches the correct answer, giving a simple yes/no score.
Click to reveal answer
Which metric checks if the correct document is among the top retrieved results in RAG?
AExact Match
BBLEU
CRecall@k
DMETEOR
What does BLEU score evaluate in RAG models?
AGenerated text quality
BRetrieval accuracy
CTraining speed
DModel size
Which metric gives a simple yes/no score if the generated answer matches exactly the correct answer?
AROUGE
BExact Match
CRecall@k
DPrecision@k
Precision@k in RAG evaluation measures:
ATraining loss
BHow many relevant documents are missed
CQuality of generated text
DHow many retrieved documents are relevant within top k
Which metric is NOT typically used to evaluate the generation part of RAG?
ARecall@k
BROUGE
CMETEOR
DBLEU
Explain the difference between retrieval and generation evaluation metrics in RAG.
Think about the two main parts of RAG: finding info and writing answers.
You got /3 concepts.
    Describe how Exact Match metric works and when it is useful in RAG evaluation.
    Consider simple yes/no correctness.
    You got /3 concepts.

      Practice

      (1/5)
      1. What does RAG evaluation metrics primarily measure in a retrieval-augmented generation system?
      easy
      A. Both the quality of generated answers and the relevance of retrieved documents
      B. Only the speed of document retrieval
      C. The size of the training dataset
      D. The number of layers in the neural network

      Solution

      1. Step 1: Understand RAG system components

        RAG combines document retrieval and answer generation, so evaluation must cover both parts.
      2. Step 2: Identify what metrics measure

        Metrics check answer quality (like accuracy) and retrieval quality (like precision).
      3. Final Answer:

        Both the quality of generated answers and the relevance of retrieved documents -> Option A
      4. Quick Check:

        RAG metrics = answer + retrieval quality [OK]
      Hint: RAG means check both answer and retrieval quality [OK]
      Common Mistakes:
      • Thinking RAG only measures answer quality
      • Confusing retrieval speed with quality
      • Ignoring document relevance in evaluation
      2. Which of the following is a common metric used to evaluate the retrieval part of a RAG system?
      easy
      A. Mean squared error
      B. BLEU score
      C. Cross-entropy loss
      D. Retrieval precision

      Solution

      1. Step 1: Identify retrieval metrics

        Retrieval precision measures how many retrieved documents are relevant.
      2. Step 2: Match metric to retrieval

        BLEU is for text generation, cross-entropy and MSE are loss functions, not retrieval metrics.
      3. Final Answer:

        Retrieval precision -> Option D
      4. Quick Check:

        Retrieval metric = precision [OK]
      Hint: Precision measures retrieval relevance, not BLEU or loss [OK]
      Common Mistakes:
      • Choosing BLEU which is for generation
      • Confusing loss functions with evaluation metrics
      • Ignoring retrieval-specific metrics
      3. Consider this Python snippet evaluating a RAG model's answer quality using F1 score:
      from sklearn.metrics import f1_score
      true_answers = ["cat", "dog", "bird"]
      pred_answers = ["cat", "dog", "cat"]
      f1 = f1_score(true_answers, pred_answers, average='macro')
      print(round(f1, 2))
      What will be the output?
      medium
      A. Error due to string inputs
      B. 0.75
      C. 0.56
      D. 1.00

      Solution

      1. Step 1: Verify f1_score handles strings

        sklearn's f1_score supports string labels directly via internal encoding.
      2. Step 2: Compute macro F1

        Classes: 'bird', 'cat', 'dog'
        • 'bird': F1 = 0 (TP=0, predicted 0 times)
        • 'cat': prec=1/2=0.5, rec=1/1=1, F1=2×0.5×1/(0.5+1)=0.67
        • 'dog': F1=1
        Macro F1 = (0 + 0.67 + 1)/3 ≈ 0.5556, round(0.56, 2) = 0.56
      3. Final Answer:

        0.56 -> Option C
      4. Quick Check:

        macro F1 = (0 + 0.67 + 1)/3 = 0.56 [OK]
      Hint: f1_score works on strings; macro F1=(0+0.67+1)/3=0.56 [OK]
      Common Mistakes:
      • Computing micro F1 or accuracy (0.67)
      • Expecting error due to strings
      • Wrong per-class calculation (0.75)
      4. You have this code snippet to compute retrieval precision but it gives wrong results:
      retrieved_docs = ["doc1", "doc2", "doc3"]
      relevant_docs = ["doc2", "doc4"]
      precision = len(set(retrieved_docs) & set(relevant_docs)) / len(relevant_docs)
      print(round(precision, 2))
      What is the bug and how to fix it?
      medium
      A. Divide by len(retrieved_docs) instead of len(relevant_docs)
      B. Use union instead of intersection in numerator
      C. Convert lists to tuples before set operations
      D. No bug, code is correct

      Solution

      1. Step 1: Understand precision formula

        Precision = relevant retrieved / total retrieved, so denominator must be retrieved docs count.
      2. Step 2: Identify denominator mistake

        Code divides by len(relevant_docs), which is recall formula denominator.
      3. Step 3: Fix denominator

        Change denominator to len(retrieved_docs) to compute precision correctly.
      4. Final Answer:

        Divide by len(retrieved_docs) instead of len(relevant_docs) -> Option A
      5. Quick Check:

        Precision denominator = retrieved docs count [OK]
      Hint: Precision divides by retrieved docs count, not relevant docs [OK]
      Common Mistakes:
      • Mixing precision with recall formula
      • Using union instead of intersection
      • Ignoring set conversion issues
      5. You want to evaluate a RAG model combining answer F1 score and retrieval precision into a single metric. Which approach is best to fairly combine these metrics?
      hard
      A. Add F1 score and retrieval precision directly
      B. Calculate the harmonic mean of F1 score and retrieval precision
      C. Use only the higher of the two scores
      D. Multiply F1 score by retrieval precision without normalization

      Solution

      1. Step 1: Understand metric combination needs

        Combining metrics requires balancing both scores fairly, avoiding dominance by one.
      2. Step 2: Evaluate combination methods

        Harmonic mean balances low and high values well; addition or multiplication can skew results.
      3. Step 3: Choose harmonic mean

        Harmonic mean is common for combining precision and recall, so it suits combining F1 and retrieval precision.
      4. Final Answer:

        Calculate the harmonic mean of F1 score and retrieval precision -> Option B
      5. Quick Check:

        Harmonic mean balances combined metrics [OK]
      Hint: Use harmonic mean to balance combined metrics fairly [OK]
      Common Mistakes:
      • Adding metrics without normalization
      • Ignoring metric scale differences
      • Choosing max score only