What does the BLEU score primarily measure in machine translation?
Think about what BLEU compares between the candidate and reference sentences.
BLEU score measures how many n-grams in the candidate translation appear in the reference translations, focusing on overlap rather than grammar or meaning.
What is the BLEU score output of the following Python code?
from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'cat', 'is', 'on', 'the', 'mat']] candidate = ['the', 'cat', 'is', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(round(score, 2))
Consider what happens when the candidate exactly matches the reference.
When the candidate sentence exactly matches the reference, BLEU score is 1.0, meaning perfect overlap.
You want to evaluate a machine translation model's output using BLEU score. Which model output is best suited for BLEU evaluation?
BLEU score compares n-grams, so think about the input format it requires.
BLEU score requires tokenized sentences to compare n-gram overlaps, so a list of tokenized sentences is needed.
Which statement about BLEU scores is correct?
Think about the range and meaning of BLEU scores.
BLEU scores range from 0 to 1, with higher scores indicating better n-gram overlap and thus better translation quality.
What error does the following code raise when calculating BLEU score?
from nltk.translate.bleu_score import sentence_bleu reference = ['the', 'cat', 'sat'] candidate = ['the', 'cat', 'sat'] score = sentence_bleu(reference, candidate) print(score)
Check the expected input format for references in sentence_bleu.
sentence_bleu expects the reference to be a list of reference sentences, each itself a list of tokens. Passing a single list causes a TypeError.