BLEU score measures how close a machine-generated text is to human-written text. It checks if the words and phrases match well. This helps us know if a translation or text generation is good. BLEU focuses on matching small groups of words (called n-grams) between the output and reference. The higher the BLEU score (from 0 to 1), the better the match.
BLEU score evaluation in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
BLEU does not use a confusion matrix like classification. Instead, it counts matching n-grams between the candidate and reference texts.
Reference: "the cat is on the mat"
Candidate: "the cat sat on the mat"
Unigram matches: the, cat, on, the, mat (5 matches)
Bigram matches: the cat, on the, the mat (3 matches)
BLEU score combines these matches with a penalty for short sentences.
BLEU mainly measures precision: how many words in the candidate appear in the reference. It does not measure recall (how many reference words appear in candidate). This means a candidate can have high BLEU by repeating common words even if it misses some meaning.
Example:
- Candidate: "the the the the the" (high precision on "the" but poor meaning)
- Candidate: "cat is on mat" (misses some words but still matches key phrases)
BLEU uses a brevity penalty to avoid very short outputs scoring too high.
BLEU scores range from 0 to 1 (often shown as 0 to 100%).
- Good BLEU: Above 0.5 (50%) usually means the output is quite close to human text.
- Moderate BLEU: Around 0.3 to 0.5 means some matching but room to improve.
- Bad BLEU: Below 0.2 means poor match, likely bad translation or text.
Note: BLEU is best used to compare models, not as an absolute quality measure.
- BLEU ignores meaning and grammar; it only checks word overlap.
- High BLEU does not always mean good quality text.
- BLEU favors shorter n-grams; longer phrase matches are harder to get.
- Using only one reference text can limit BLEU's reliability.
- BLEU does not measure recall, so missing important words is not penalized enough.
Your machine translation model has a BLEU score of 0.65. Is this good? Why or why not?
Answer: A BLEU score of 0.65 is generally good, showing strong overlap with human translations. However, you should also check the actual text quality because BLEU does not capture meaning or grammar perfectly.
Practice
Solution
Step 1: Understand BLEU score purpose
BLEU score is designed to compare machine translations to human reference translations.Step 2: Identify what BLEU measures
It measures similarity in words and phrases, not speed or grammar correctness.Final Answer:
How close the machine translation is to human translations -> Option AQuick Check:
BLEU = similarity to human translations [OK]
- Confusing BLEU with translation speed
- Thinking BLEU measures grammar correctness
- Assuming BLEU counts total words only
Solution
Step 1: Recall NLTK BLEU function syntax
The correct function is sentence_bleu and it takes a list of references and a candidate sentence.Step 2: Match correct argument order
References must be a list of lists, candidate is a list of tokens.Final Answer:
bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) -> Option BQuick Check:
Use sentence_bleu([ref], cand) syntax [OK]
- Passing candidate before reference
- Not wrapping reference in a list
- Using incorrect function names
["the", "cat", "is", "on", "the", "mat"] and reference sentence ["there", "is", "a", "cat", "on", "the", "mat"], what is the approximate BLEU score (unigram precision only)?Solution
Step 1: Calculate unigram matches
Candidate words: the, cat, is, on, the, mat
Reference words: there, is, a, cat, on, the, mat
Matching unigrams: the, cat, is, on, mat (count matches carefully)Step 2: Compute unigram precision
Matches = 5 (the counted once), Candidate length = 6
Precision = 5/6 ≈ 0.83 but 'the' appears twice in candidate but once in reference, so max count for 'the' is 1.
Counting max matches: 'the' once, 'cat' once, 'is' once, 'on' once, 'mat' once = 5 matches
Precision = 5/6 ≈ 0.83Step 3: Adjust for max counts
Since 'the' appears twice in candidate but only once in reference, only one 'the' counts.
So total matches = 5, candidate length = 6, precision = 5/6 ≈ 0.83Final Answer:
0.83 -> Option AQuick Check:
Unigram precision = 5/6 = 0.83 [OK]
- Counting repeated words more than reference max
- Confusing unigram with bigram precision
- Ignoring max count clipping
from nltk.translate.bleu_score import sentence_bleu reference = ['the', 'cat', 'is', 'on', 'the', 'mat'] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(score)
Solution
Step 1: Check sentence_bleu input format
sentence_bleu expects references as a list of reference sentences (each a list of tokens), so reference must be wrapped in another list.Step 2: Identify the error in code
Reference is given as a single list, not a list of lists, causing a type error or wrong calculation.Final Answer:
Reference should be a list of lists, not a single list -> Option CQuick Check:
References = list of lists [OK]
- Passing reference as a flat list
- Passing candidate as string instead of list
- Ignoring input format requirements
ref1 = ['the', 'cat', 'is', 'on', 'the', 'mat']ref2 = ['there', 'is', 'a', 'cat', 'on', 'the', 'mat']And a candidate translation:
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']How should you prepare the references to correctly compute the BLEU score considering multiple references?
Solution
Step 1: Understand multiple references in BLEU
BLEU supports multiple references by passing a list of reference sentences (each a list of tokens).Step 2: Prepare references correctly
References should be passed as [ref1, ref2], a list containing both reference lists.Step 3: Avoid incorrect methods
Concatenating references or passing separately will give wrong results.Final Answer:
Pass references as a list containing both ref1 and ref2 lists -> Option DQuick Check:
Multiple references = list of reference lists [OK]
- Concatenating references into one list
- Passing references separately in multiple calls
- Using only one reference when multiple exist
