Bird
Raised Fist0
NLPml~12 mins

Evaluating generated text (BLEU, ROUGE) in NLP - Model Pipeline Trace

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Model Pipeline - Evaluating generated text (BLEU, ROUGE)

This pipeline shows how we check the quality of text generated by a computer. We compare the generated text to a correct example using scores called BLEU and ROUGE.

Data Flow - 5 Stages
1Input Texts
1000 pairs of sentencesCollect pairs of generated text and reference text1000 pairs of sentences
Generated: 'The cat sat on the mat.' Reference: 'A cat is sitting on the mat.'
2Tokenization
1000 pairs of sentencesSplit sentences into words (tokens)1000 pairs of token lists
['The', 'cat', 'sat', 'on', 'the', 'mat'] and ['A', 'cat', 'is', 'sitting', 'on', 'the', 'mat']
3Calculate BLEU Score
1000 pairs of token listsCompare n-grams between generated and reference texts to get BLEU score1000 BLEU scores (0 to 1)
BLEU score: 0.65
4Calculate ROUGE Score
1000 pairs of token listsCompare overlapping units like n-grams and longest common subsequence for ROUGE scores1000 ROUGE scores (Recall, Precision, F1)
ROUGE-L F1 score: 0.70
5Aggregate Scores
1000 BLEU and ROUGE scoresCalculate average scores to summarize model performanceAverage BLEU and ROUGE scores
Average BLEU: 0.62, Average ROUGE-L F1: 0.68
Training Trace - Epoch by Epoch

Loss
1.0 |***************
0.8 |************
0.6 |********
0.4 |*****
0.2 |**
0.0 +----------------
     1 2 3 4 5 Epochs
EpochLoss ↓Accuracy ↑Observation
10.850.40Initial evaluation shows low BLEU and ROUGE scores indicating poor text quality.
20.650.55Scores improve as the model learns to generate more similar text.
30.500.68Better matching of n-grams and sequences reflected in higher BLEU and ROUGE.
40.400.75Model generates more fluent and relevant text, scores continue to rise.
50.350.80Training converges with good BLEU and ROUGE scores showing quality text generation.
Prediction Trace - 5 Layers
Layer 1: Input Generated Sentence
Layer 2: Input Reference Sentence
Layer 3: Calculate BLEU Score
Layer 4: Calculate ROUGE Score
Layer 5: Final Evaluation
Model Quiz - 3 Questions
Test your understanding
What does the BLEU score mainly measure in generated text?
AThe length of the generated text
BMatching groups of words (n-grams) with the reference text
CThe number of unique words in the generated text
DThe grammatical correctness of the generated text
Key Insight
BLEU and ROUGE scores help us measure how close generated text is to a reference. By tracking these scores during training, we see the model improve in producing more accurate and fluent text.

Practice

(1/5)
1. What is the main purpose of BLEU and ROUGE scores in evaluating generated text?
easy
A. To measure how similar the generated text is to human-written text
B. To check the spelling errors in generated text
C. To count the number of words in the generated text
D. To translate text from one language to another

Solution

  1. Step 1: Understand the role of BLEU and ROUGE

    Both BLEU and ROUGE are metrics used to compare generated text with reference human text to check similarity.
  2. Step 2: Identify the main purpose

    They do not check spelling, count words, or translate text but measure similarity to human text.
  3. Final Answer:

    To measure how similar the generated text is to human-written text -> Option A
  4. Quick Check:

    BLEU and ROUGE measure similarity [OK]
Hint: Remember: BLEU and ROUGE check similarity, not spelling or translation [OK]
Common Mistakes:
  • Confusing BLEU/ROUGE with spell check
  • Thinking they count words only
  • Assuming they translate text
2. Which of the following is the correct way to calculate BLEU score using Python's nltk library?
easy
A. bleu_score = nltk.bleu_score([candidate], reference)
B. bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate)
C. bleu_score = nltk.translate.bleu_score([candidate], [reference])
D. bleu_score = nltk.score.bleu(candidate, reference)

Solution

  1. Step 1: Recall the nltk BLEU function syntax

    The correct function is sentence_bleu from nltk.translate.bleu_score, which takes a list of references and a candidate sentence.
  2. Step 2: Match the correct syntax

    bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) uses sentence_bleu([reference], candidate), which is the correct call format.
  3. Final Answer:

    bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) -> Option B
  4. Quick Check:

    Use sentence_bleu with list of references [OK]
Hint: Use sentence_bleu with references as a list [OK]
Common Mistakes:
  • Passing candidate as first argument instead of second
  • Not wrapping reference in a list
  • Using wrong module or function name
3. Given the following code snippet, what will be the printed BLEU score?
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate)
print(round(score, 2))
medium
A. 0.92
B. 0.75
C. 0.58
D. 0.33

Solution

  1. Step 1: Understand BLEU calculation basics

    BLEU compares n-gram overlap; here, candidate differs by one word ('sat' vs 'is'), so score is high but not perfect.
  2. Step 2: Run or estimate BLEU score

    Running this code yields approximately 0.916, rounded to 0.92.
  3. Final Answer:

    0.92 -> Option A
  4. Quick Check:

    BLEU score close to 1 means high similarity [OK]
Hint: BLEU near 1 means very similar sentences [OK]
Common Mistakes:
  • Assuming exact match needed for high BLEU
  • Confusing BLEU with ROUGE
  • Ignoring n-gram overlap effect
4. You wrote code to compute ROUGE-L score but get an error: AttributeError: module 'rouge' has no attribute 'Rouge'. What is the likely cause?
medium
A. The input texts are empty strings
B. ROUGE-L score cannot be computed in Python
C. The 'rouge' package is not installed or imported incorrectly
D. You must use BLEU instead of ROUGE-L

Solution

  1. Step 1: Analyze the error message

    The error says the module 'rouge' has no attribute 'Rouge', indicating the package or import is missing or incorrect.
  2. Step 2: Understand correct usage

    You need to install the correct 'rouge' package and import Rouge class properly to use ROUGE-L.
  3. Final Answer:

    The 'rouge' package is not installed or imported incorrectly -> Option C
  4. Quick Check:

    AttributeError usually means missing or wrong import [OK]
Hint: Check package installation and import statements first [OK]
Common Mistakes:
  • Assuming ROUGE-L can't be computed in Python
  • Ignoring installation errors
  • Using wrong package names
5. You have two text generation models. Model A has a BLEU score of 0.45 and ROUGE-L score of 0.60. Model B has a BLEU score of 0.55 and ROUGE-L score of 0.50. Which model should you prefer if you want better phrase matching and why?
hard
A. Model A, because lower BLEU means better phrase matching
B. Model A, because higher ROUGE-L means better phrase matching
C. Model B, because lower ROUGE-L means better phrase matching
D. Model B, because higher BLEU means better phrase matching

Solution

  1. Step 1: Understand BLEU and ROUGE focus

    BLEU focuses on phrase matching; ROUGE-L focuses on longest common subsequence (word overlap).
  2. Step 2: Compare scores for phrase matching

    Model B has higher BLEU (0.55) than Model A (0.45), so Model B is better for phrase matching.
  3. Final Answer:

    Model B, because higher BLEU means better phrase matching -> Option D
  4. Quick Check:

    Higher BLEU = better phrase matching [OK]
Hint: BLEU = phrase match; ROUGE = word overlap [OK]
Common Mistakes:
  • Confusing BLEU and ROUGE meanings
  • Choosing model with higher ROUGE for phrase matching
  • Ignoring which metric matches the goal