We use BLEU and ROUGE to check how good a computer's text is compared to human writing. They help us see if the computer's words make sense and match what we expect.
Evaluating generated text (BLEU, ROUGE) in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
from nltk.translate.bleu_score import sentence_bleu from rouge_score import rouge_scorer # BLEU score example reference = [['this', 'is', 'a', 'test']] candidate = ['this', 'is', 'test'] bleu_score = sentence_bleu(reference, candidate) # ROUGE score example scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True) scores = scorer.score('this is test', 'this is a test')
BLEU compares words and phrases between candidate and reference texts.
ROUGE focuses on overlapping words and sequences, often used for summaries.
Examples
NLP
reference = [['the', 'cat', 'is', 'on', 'the', 'mat']] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] bleu = sentence_bleu(reference, candidate) print(f'BLEU score: {bleu:.2f}')
NLP
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True) scores = scorer.score('the cat sat on the mat', 'the cat is on the mat') print(scores)
Sample Model
This program compares a candidate sentence to a reference using BLEU and ROUGE. Since both texts are the same, scores will be high.
NLP
from nltk.translate.bleu_score import sentence_bleu from rouge_score import rouge_scorer reference = [['the', 'quick', 'brown', 'fox']] candidate = ['the', 'quick', 'brown', 'fox'] bleu = sentence_bleu(reference, candidate) scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True) scores = scorer.score('the quick brown fox', 'the quick brown fox') print(f'BLEU score: {bleu:.2f}') print('ROUGE scores:', scores)
Important Notes
BLEU scores range from 0 to 1, where 1 means perfect match.
ROUGE gives multiple scores; F-measure balances precision and recall.
Both metrics work best with multiple reference texts for fair comparison.
Summary
BLEU and ROUGE help measure how close generated text is to human text.
BLEU focuses on matching phrases; ROUGE focuses on overlapping words and sequences.
Use these scores to improve and compare text generation models.
Practice
1. What is the main purpose of BLEU and ROUGE scores in evaluating generated text?
easy
Solution
Step 1: Understand the role of BLEU and ROUGE
Both BLEU and ROUGE are metrics used to compare generated text with reference human text to check similarity.Step 2: Identify the main purpose
They do not check spelling, count words, or translate text but measure similarity to human text.Final Answer:
To measure how similar the generated text is to human-written text -> Option AQuick Check:
BLEU and ROUGE measure similarity [OK]
Hint: Remember: BLEU and ROUGE check similarity, not spelling or translation [OK]
Common Mistakes:
- Confusing BLEU/ROUGE with spell check
- Thinking they count words only
- Assuming they translate text
2. Which of the following is the correct way to calculate BLEU score using Python's nltk library?
easy
Solution
Step 1: Recall the nltk BLEU function syntax
The correct function is sentence_bleu from nltk.translate.bleu_score, which takes a list of references and a candidate sentence.Step 2: Match the correct syntax
bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) uses sentence_bleu([reference], candidate), which is the correct call format.Final Answer:
bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) -> Option BQuick Check:
Use sentence_bleu with list of references [OK]
Hint: Use sentence_bleu with references as a list [OK]
Common Mistakes:
- Passing candidate as first argument instead of second
- Not wrapping reference in a list
- Using wrong module or function name
3. Given the following code snippet, what will be the printed BLEU score?
from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'cat', 'is', 'on', 'the', 'mat']] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(round(score, 2))
medium
Solution
Step 1: Understand BLEU calculation basics
BLEU compares n-gram overlap; here, candidate differs by one word ('sat' vs 'is'), so score is high but not perfect.Step 2: Run or estimate BLEU score
Running this code yields approximately 0.916, rounded to 0.92.Final Answer:
0.92 -> Option AQuick Check:
BLEU score close to 1 means high similarity [OK]
Hint: BLEU near 1 means very similar sentences [OK]
Common Mistakes:
- Assuming exact match needed for high BLEU
- Confusing BLEU with ROUGE
- Ignoring n-gram overlap effect
4. You wrote code to compute ROUGE-L score but get an error:
AttributeError: module 'rouge' has no attribute 'Rouge'. What is the likely cause?medium
Solution
Step 1: Analyze the error message
The error says the module 'rouge' has no attribute 'Rouge', indicating the package or import is missing or incorrect.Step 2: Understand correct usage
You need to install the correct 'rouge' package and import Rouge class properly to use ROUGE-L.Final Answer:
The 'rouge' package is not installed or imported incorrectly -> Option CQuick Check:
AttributeError usually means missing or wrong import [OK]
Hint: Check package installation and import statements first [OK]
Common Mistakes:
- Assuming ROUGE-L can't be computed in Python
- Ignoring installation errors
- Using wrong package names
5. You have two text generation models. Model A has a BLEU score of 0.45 and ROUGE-L score of 0.60. Model B has a BLEU score of 0.55 and ROUGE-L score of 0.50. Which model should you prefer if you want better phrase matching and why?
hard
Solution
Step 1: Understand BLEU and ROUGE focus
BLEU focuses on phrase matching; ROUGE-L focuses on longest common subsequence (word overlap).Step 2: Compare scores for phrase matching
Model B has higher BLEU (0.55) than Model A (0.45), so Model B is better for phrase matching.Final Answer:
Model B, because higher BLEU means better phrase matching -> Option DQuick Check:
Higher BLEU = better phrase matching [OK]
Hint: BLEU = phrase match; ROUGE = word overlap [OK]
Common Mistakes:
- Confusing BLEU and ROUGE meanings
- Choosing model with higher ROUGE for phrase matching
- Ignoring which metric matches the goal
