Challenge - 5 Problems
Text Evaluation Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
What is the BLEU score output of this code?
Given the reference and candidate sentences below, what is the BLEU score computed by the code?
NLP
from nltk.translate.bleu_score import sentence_bleu reference = [['the', 'cat', 'is', 'on', 'the', 'mat']] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(round(score, 3))
Attempts:
2 left
💡 Hint
BLEU score compares n-gram overlap; small differences in words reduce the score.
✗ Incorrect
The BLEU score here is about 0.759 because the candidate differs by one word ('sat' vs 'is'), reducing the n-gram matches but still having high overlap.
🧠 Conceptual
intermediate1:30remaining
Which metric is best for evaluating summary length and content overlap?
You want to evaluate how well a generated summary matches a reference summary, focusing on content overlap and length. Which metric is most suitable?
Attempts:
2 left
💡 Hint
This metric is designed for comparing summaries and measures overlap of sequences.
✗ Incorrect
ROUGE is designed to evaluate summaries by measuring overlap of n-grams, sequences, and longest common subsequences, making it better for summary evaluation than BLEU.
❓ Metrics
advanced1:30remaining
What ROUGE metric measures longest common subsequence overlap?
Among ROUGE-1, ROUGE-2, and ROUGE-L, which one measures the longest common subsequence between generated and reference texts?
Attempts:
2 left
💡 Hint
This metric captures sentence-level structure by longest matching sequence.
✗ Incorrect
ROUGE-L measures the longest common subsequence (LCS) between candidate and reference, capturing sentence-level structure beyond simple n-gram overlap.
🔧 Debug
advanced2:00remaining
Why does this BLEU score code raise an error?
What error does this code raise and why?
from nltk.translate.bleu_score import sentence_bleu
reference = ['the', 'cat', 'is', 'on', 'the', 'mat']
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate)
print(score)
Attempts:
2 left
💡 Hint
Check the expected input type for references in sentence_bleu.
✗ Incorrect
sentence_bleu expects a list of reference sentences (each a list of tokens), but here a single list is passed, causing a TypeError.
❓ Model Choice
expert1:30remaining
Which evaluation metric is best for machine translation quality?
You want to evaluate machine translation output quality automatically. Which metric is most widely accepted and specifically designed for this task?
Attempts:
2 left
💡 Hint
This metric compares n-gram overlap between candidate and reference translations.
✗ Incorrect
BLEU is the standard metric for machine translation evaluation, measuring n-gram precision with brevity penalty to assess translation quality.