Bird
Raised Fist0
NLPml~5 mins

Evaluating generated text (BLEU, ROUGE) in NLP - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What does BLEU score measure in text generation?
BLEU (Bilingual Evaluation Understudy) measures how closely the generated text matches one or more reference texts by comparing overlapping n-grams.
Click to reveal answer
beginner
What is ROUGE used for in evaluating generated text?
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap of units such as n-grams, word sequences, and word pairs between the generated text and reference summaries, focusing on recall.
Click to reveal answer
intermediate
Why is BLEU considered precision-oriented while ROUGE is recall-oriented?
BLEU focuses on how much of the generated text matches the reference (precision), while ROUGE focuses on how much of the reference text is covered by the generated text (recall).
Click to reveal answer
beginner
What is an n-gram in the context of BLEU and ROUGE?
An n-gram is a sequence of 'n' words in a row. For example, a 2-gram (bigram) is two consecutive words. Both BLEU and ROUGE compare these sequences between generated and reference texts.
Click to reveal answer
intermediate
How does BLEU handle multiple reference texts?
BLEU compares the generated text against multiple reference texts and uses the best matching n-grams from any reference to calculate the score, improving evaluation accuracy.
Click to reveal answer
What does a high BLEU score indicate?
AThe generated text is very different from the reference
BThe generated text closely matches the reference text
CThe generated text is longer than the reference
DThe generated text has many spelling errors
Which metric is more focused on recall in text evaluation?
ABLEU
BF1 Score
CAccuracy
DROUGE
What is an n-gram?
AA sequence of n words in a row
BA single word
CA type of neural network
DA punctuation mark
Which of these is true about BLEU?
AIt measures recall of generated text
BIt only works with one reference text
CIt measures precision of generated text
DIt ignores word order
ROUGE is commonly used to evaluate which type of generated text?
AText summarization
BSpeech recognition
CMachine translation
DImage captioning
Explain how BLEU and ROUGE differ in evaluating generated text.
Think about what each metric focuses on: matching generated text vs. covering reference text.
You got /4 concepts.
    Describe what an n-gram is and why it is important for BLEU and ROUGE.
    Consider how small word groups help check if texts are similar.
    You got /3 concepts.

      Practice

      (1/5)
      1. What is the main purpose of BLEU and ROUGE scores in evaluating generated text?
      easy
      A. To measure how similar the generated text is to human-written text
      B. To check the spelling errors in generated text
      C. To count the number of words in the generated text
      D. To translate text from one language to another

      Solution

      1. Step 1: Understand the role of BLEU and ROUGE

        Both BLEU and ROUGE are metrics used to compare generated text with reference human text to check similarity.
      2. Step 2: Identify the main purpose

        They do not check spelling, count words, or translate text but measure similarity to human text.
      3. Final Answer:

        To measure how similar the generated text is to human-written text -> Option A
      4. Quick Check:

        BLEU and ROUGE measure similarity [OK]
      Hint: Remember: BLEU and ROUGE check similarity, not spelling or translation [OK]
      Common Mistakes:
      • Confusing BLEU/ROUGE with spell check
      • Thinking they count words only
      • Assuming they translate text
      2. Which of the following is the correct way to calculate BLEU score using Python's nltk library?
      easy
      A. bleu_score = nltk.bleu_score([candidate], reference)
      B. bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate)
      C. bleu_score = nltk.translate.bleu_score([candidate], [reference])
      D. bleu_score = nltk.score.bleu(candidate, reference)

      Solution

      1. Step 1: Recall the nltk BLEU function syntax

        The correct function is sentence_bleu from nltk.translate.bleu_score, which takes a list of references and a candidate sentence.
      2. Step 2: Match the correct syntax

        bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) uses sentence_bleu([reference], candidate), which is the correct call format.
      3. Final Answer:

        bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) -> Option B
      4. Quick Check:

        Use sentence_bleu with list of references [OK]
      Hint: Use sentence_bleu with references as a list [OK]
      Common Mistakes:
      • Passing candidate as first argument instead of second
      • Not wrapping reference in a list
      • Using wrong module or function name
      3. Given the following code snippet, what will be the printed BLEU score?
      from nltk.translate.bleu_score import sentence_bleu
      reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
      candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
      score = sentence_bleu(reference, candidate)
      print(round(score, 2))
      medium
      A. 0.92
      B. 0.75
      C. 0.58
      D. 0.33

      Solution

      1. Step 1: Understand BLEU calculation basics

        BLEU compares n-gram overlap; here, candidate differs by one word ('sat' vs 'is'), so score is high but not perfect.
      2. Step 2: Run or estimate BLEU score

        Running this code yields approximately 0.916, rounded to 0.92.
      3. Final Answer:

        0.92 -> Option A
      4. Quick Check:

        BLEU score close to 1 means high similarity [OK]
      Hint: BLEU near 1 means very similar sentences [OK]
      Common Mistakes:
      • Assuming exact match needed for high BLEU
      • Confusing BLEU with ROUGE
      • Ignoring n-gram overlap effect
      4. You wrote code to compute ROUGE-L score but get an error: AttributeError: module 'rouge' has no attribute 'Rouge'. What is the likely cause?
      medium
      A. The input texts are empty strings
      B. ROUGE-L score cannot be computed in Python
      C. The 'rouge' package is not installed or imported incorrectly
      D. You must use BLEU instead of ROUGE-L

      Solution

      1. Step 1: Analyze the error message

        The error says the module 'rouge' has no attribute 'Rouge', indicating the package or import is missing or incorrect.
      2. Step 2: Understand correct usage

        You need to install the correct 'rouge' package and import Rouge class properly to use ROUGE-L.
      3. Final Answer:

        The 'rouge' package is not installed or imported incorrectly -> Option C
      4. Quick Check:

        AttributeError usually means missing or wrong import [OK]
      Hint: Check package installation and import statements first [OK]
      Common Mistakes:
      • Assuming ROUGE-L can't be computed in Python
      • Ignoring installation errors
      • Using wrong package names
      5. You have two text generation models. Model A has a BLEU score of 0.45 and ROUGE-L score of 0.60. Model B has a BLEU score of 0.55 and ROUGE-L score of 0.50. Which model should you prefer if you want better phrase matching and why?
      hard
      A. Model A, because lower BLEU means better phrase matching
      B. Model A, because higher ROUGE-L means better phrase matching
      C. Model B, because lower ROUGE-L means better phrase matching
      D. Model B, because higher BLEU means better phrase matching

      Solution

      1. Step 1: Understand BLEU and ROUGE focus

        BLEU focuses on phrase matching; ROUGE-L focuses on longest common subsequence (word overlap).
      2. Step 2: Compare scores for phrase matching

        Model B has higher BLEU (0.55) than Model A (0.45), so Model B is better for phrase matching.
      3. Final Answer:

        Model B, because higher BLEU means better phrase matching -> Option D
      4. Quick Check:

        Higher BLEU = better phrase matching [OK]
      Hint: BLEU = phrase match; ROUGE = word overlap [OK]
      Common Mistakes:
      • Confusing BLEU and ROUGE meanings
      • Choosing model with higher ROUGE for phrase matching
      • Ignoring which metric matches the goal