When we want to check how good a computer-generated text is, we use special scores called BLEU and ROUGE. These scores compare the generated text to a set of good example texts (called references). BLEU looks at how many small word groups (like pairs or triples) match exactly. ROUGE checks how many words or sentences overlap, focusing on recall (how much of the reference is covered). We use BLEU when we want to see if the generated text is precise and similar to the reference. ROUGE is useful when we want to make sure the generated text covers the important parts of the reference, especially for summaries.
Evaluating generated text (BLEU, ROUGE) in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
For text generation, we don't use a confusion matrix like in classification. Instead, we look at n-gram overlaps. Here is a simple example of how BLEU counts matching word groups:
Reference: "The cat sat on the mat"
Generated: "The cat is on the mat"
Unigrams (single words) match: The, cat, on, the, mat (5 matches)
Bigrams (pairs) match: "The cat", "on the", "the mat" (3 matches)
BLEU score combines these matches to give a score between 0 and 1.
ROUGE looks at recall of overlapping words or sequences, for example:
Reference summary: "The cat sat on the mat quietly."
Generated summary: "Cat sat quietly on mat."
ROUGE measures how many words or phrases from the reference appear in the generated text.
BLEU focuses more on precision: it checks how much of the generated text matches the reference exactly. If the generated text has many extra or wrong words, BLEU score goes down.
ROUGE focuses more on recall: it checks how much of the reference text is covered by the generated text. If the generated text misses important parts, ROUGE score goes down.
Example:
- If a summary includes only a few words but all are correct, BLEU might be high but ROUGE low (low recall).
- If a summary covers many important points but adds some extra words, ROUGE might be high but BLEU lower (lower precision).
Choosing which metric to focus on depends on what matters more: exactness (BLEU) or coverage (ROUGE).
Good BLEU or ROUGE scores are closer to 1.0, meaning the generated text is very similar to the reference.
Good example: BLEU = 0.7, ROUGE = 0.75 means the generated text matches well in both exact words and coverage.
Bad example: BLEU = 0.2, ROUGE = 0.3 means the generated text is quite different or missing important parts.
However, scores depend on the task. For creative writing, lower scores might be okay. For machine translation or summaries, higher scores are expected.
- Overfitting to references: Models might copy reference text exactly to get high scores but produce less natural text.
- Ignoring meaning: BLEU and ROUGE check word overlap, not if the meaning is correct or fluent.
- Short text bias: Very short generated texts can get high precision but miss important content.
- Multiple valid outputs: There can be many good ways to say the same thing, but BLEU/ROUGE only compare to given references.
- Data leakage: Using test references during training inflates scores unfairly.
Your text generation model has a BLEU score of 0.85 but a ROUGE score of 0.40. Is this good for a summary task? Why or why not?
Answer: This means the generated text matches the reference words very precisely (high BLEU) but covers only a small part of the reference (low ROUGE). For summaries, coverage is important, so this model might miss key points. It is not good enough for summary tasks because it lacks recall.
Practice
Solution
Step 1: Understand the role of BLEU and ROUGE
Both BLEU and ROUGE are metrics used to compare generated text with reference human text to check similarity.Step 2: Identify the main purpose
They do not check spelling, count words, or translate text but measure similarity to human text.Final Answer:
To measure how similar the generated text is to human-written text -> Option AQuick Check:
BLEU and ROUGE measure similarity [OK]
- Confusing BLEU/ROUGE with spell check
- Thinking they count words only
- Assuming they translate text
Solution
Step 1: Recall the nltk BLEU function syntax
The correct function is sentence_bleu from nltk.translate.bleu_score, which takes a list of references and a candidate sentence.Step 2: Match the correct syntax
bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) uses sentence_bleu([reference], candidate), which is the correct call format.Final Answer:
bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) -> Option BQuick Check:
Use sentence_bleu with list of references [OK]
- Passing candidate as first argument instead of second
- Not wrapping reference in a list
- Using wrong module or function name
from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'cat', 'is', 'on', 'the', 'mat']] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(round(score, 2))
Solution
Step 1: Understand BLEU calculation basics
BLEU compares n-gram overlap; here, candidate differs by one word ('sat' vs 'is'), so score is high but not perfect.Step 2: Run or estimate BLEU score
Running this code yields approximately 0.916, rounded to 0.92.Final Answer:
0.92 -> Option AQuick Check:
BLEU score close to 1 means high similarity [OK]
- Assuming exact match needed for high BLEU
- Confusing BLEU with ROUGE
- Ignoring n-gram overlap effect
AttributeError: module 'rouge' has no attribute 'Rouge'. What is the likely cause?Solution
Step 1: Analyze the error message
The error says the module 'rouge' has no attribute 'Rouge', indicating the package or import is missing or incorrect.Step 2: Understand correct usage
You need to install the correct 'rouge' package and import Rouge class properly to use ROUGE-L.Final Answer:
The 'rouge' package is not installed or imported incorrectly -> Option CQuick Check:
AttributeError usually means missing or wrong import [OK]
- Assuming ROUGE-L can't be computed in Python
- Ignoring installation errors
- Using wrong package names
Solution
Step 1: Understand BLEU and ROUGE focus
BLEU focuses on phrase matching; ROUGE-L focuses on longest common subsequence (word overlap).Step 2: Compare scores for phrase matching
Model B has higher BLEU (0.55) than Model A (0.45), so Model B is better for phrase matching.Final Answer:
Model B, because higher BLEU means better phrase matching -> Option DQuick Check:
Higher BLEU = better phrase matching [OK]
- Confusing BLEU and ROUGE meanings
- Choosing model with higher ROUGE for phrase matching
- Ignoring which metric matches the goal
