BLEU score helps us check how good a computer's translation is by comparing it to human translations. It tells us if the computer is doing a good job.
BLEU score evaluation in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
from nltk.translate.bleu_score import sentence_bleu reference = [['this', 'is', 'a', 'test']] candidate = ['this', 'is', 'a', 'test'] score = sentence_bleu(reference, candidate) print(score)
The reference is a list of correct translations (each is a list of words).
The candidate is the machine's translation (a list of words).
Examples
NLP
reference = [['the', 'cat', 'is', 'on', 'the', 'mat']] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(score)
NLP
references = [['this', 'is', 'a', 'test'], ['this', 'is', 'test']] candidate = ['this', 'is', 'a', 'test'] score = sentence_bleu(references, candidate) print(score)
Sample Model
This program calculates the BLEU score between one human reference and one machine candidate sentence. It shows how close the machine's sentence is to the human's.
NLP
from nltk.translate.bleu_score import sentence_bleu # One reference translation reference = [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']] # Candidate translation from machine candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog'] # Calculate BLEU score score = sentence_bleu(reference, candidate) print(f"BLEU score: {score:.4f}")
Important Notes
BLEU score ranges from 0 to 1, where 1 means perfect match.
BLEU uses matching of small word groups (called n-grams) to compare sentences.
Shorter candidate sentences may get lower scores even if correct.
Summary
BLEU score measures how close a machine translation is to human translations.
It compares words and word groups between candidate and reference sentences.
Higher BLEU means better translation quality.
Practice
1. What does the BLEU score primarily measure in machine translation?
easy
Solution
Step 1: Understand BLEU score purpose
BLEU score is designed to compare machine translations to human reference translations.Step 2: Identify what BLEU measures
It measures similarity in words and phrases, not speed or grammar correctness.Final Answer:
How close the machine translation is to human translations -> Option AQuick Check:
BLEU = similarity to human translations [OK]
Hint: BLEU = closeness to human translation quality [OK]
Common Mistakes:
- Confusing BLEU with translation speed
- Thinking BLEU measures grammar correctness
- Assuming BLEU counts total words only
2. Which of the following is the correct way to calculate the BLEU score using NLTK in Python?
easy
Solution
Step 1: Recall NLTK BLEU function syntax
The correct function is sentence_bleu and it takes a list of references and a candidate sentence.Step 2: Match correct argument order
References must be a list of lists, candidate is a list of tokens.Final Answer:
bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) -> Option BQuick Check:
Use sentence_bleu([ref], cand) syntax [OK]
Hint: Use sentence_bleu with references as list of lists [OK]
Common Mistakes:
- Passing candidate before reference
- Not wrapping reference in a list
- Using incorrect function names
3. Given the candidate sentence
["the", "cat", "is", "on", "the", "mat"] and reference sentence ["there", "is", "a", "cat", "on", "the", "mat"], what is the approximate BLEU score (unigram precision only)?medium
Solution
Step 1: Calculate unigram matches
Candidate words: the, cat, is, on, the, mat
Reference words: there, is, a, cat, on, the, mat
Matching unigrams: the, cat, is, on, mat (count matches carefully)Step 2: Compute unigram precision
Matches = 5 (the counted once), Candidate length = 6
Precision = 5/6 ≈ 0.83 but 'the' appears twice in candidate but once in reference, so max count for 'the' is 1.
Counting max matches: 'the' once, 'cat' once, 'is' once, 'on' once, 'mat' once = 5 matches
Precision = 5/6 ≈ 0.83Step 3: Adjust for max counts
Since 'the' appears twice in candidate but only once in reference, only one 'the' counts.
So total matches = 5, candidate length = 6, precision = 5/6 ≈ 0.83Final Answer:
0.83 -> Option AQuick Check:
Unigram precision = 5/6 = 0.83 [OK]
Hint: Count max reference word matches for unigram precision [OK]
Common Mistakes:
- Counting repeated words more than reference max
- Confusing unigram with bigram precision
- Ignoring max count clipping
4. Identify the error in this BLEU score calculation code snippet:
from nltk.translate.bleu_score import sentence_bleu reference = ['the', 'cat', 'is', 'on', 'the', 'mat'] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(score)
medium
Solution
Step 1: Check sentence_bleu input format
sentence_bleu expects references as a list of reference sentences (each a list of tokens), so reference must be wrapped in another list.Step 2: Identify the error in code
Reference is given as a single list, not a list of lists, causing a type error or wrong calculation.Final Answer:
Reference should be a list of lists, not a single list -> Option CQuick Check:
References = list of lists [OK]
Hint: Wrap reference in a list for sentence_bleu [OK]
Common Mistakes:
- Passing reference as a flat list
- Passing candidate as string instead of list
- Ignoring input format requirements
5. You have two reference translations:
And a candidate translation:
How should you prepare the references to correctly compute the BLEU score considering multiple references?
ref1 = ['the', 'cat', 'is', 'on', 'the', 'mat']ref2 = ['there', 'is', 'a', 'cat', 'on', 'the', 'mat']And a candidate translation:
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']How should you prepare the references to correctly compute the BLEU score considering multiple references?
hard
Solution
Step 1: Understand multiple references in BLEU
BLEU supports multiple references by passing a list of reference sentences (each a list of tokens).Step 2: Prepare references correctly
References should be passed as [ref1, ref2], a list containing both reference lists.Step 3: Avoid incorrect methods
Concatenating references or passing separately will give wrong results.Final Answer:
Pass references as a list containing both ref1 and ref2 lists -> Option DQuick Check:
Multiple references = list of reference lists [OK]
Hint: Use list of reference lists for multiple references [OK]
Common Mistakes:
- Concatenating references into one list
- Passing references separately in multiple calls
- Using only one reference when multiple exist
