0
0
NLPml~20 mins

Evaluating generated text (BLEU, ROUGE) in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Text Evaluation Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the BLEU score output of this code?
Given the reference and candidate sentences below, what is the BLEU score computed by the code?
NLP
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate)
print(round(score, 3))
A0.840
B0.759
C0.667
D0.500
Attempts:
2 left
💡 Hint
BLEU score compares n-gram overlap; small differences in words reduce the score.
🧠 Conceptual
intermediate
1:30remaining
Which metric is best for evaluating summary length and content overlap?
You want to evaluate how well a generated summary matches a reference summary, focusing on content overlap and length. Which metric is most suitable?
AROUGE
BBLEU
CAccuracy
DMean Squared Error
Attempts:
2 left
💡 Hint
This metric is designed for comparing summaries and measures overlap of sequences.
Metrics
advanced
1:30remaining
What ROUGE metric measures longest common subsequence overlap?
Among ROUGE-1, ROUGE-2, and ROUGE-L, which one measures the longest common subsequence between generated and reference texts?
AROUGE-L
BROUGE-1
CROUGE-W
DROUGE-2
Attempts:
2 left
💡 Hint
This metric captures sentence-level structure by longest matching sequence.
🔧 Debug
advanced
2:00remaining
Why does this BLEU score code raise an error?
What error does this code raise and why? from nltk.translate.bleu_score import sentence_bleu reference = ['the', 'cat', 'is', 'on', 'the', 'mat'] candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat'] score = sentence_bleu(reference, candidate) print(score)
AValueError: empty reference list
BSyntaxError: invalid syntax
CNo error, prints BLEU score
DTypeError: expected list of references, got list
Attempts:
2 left
💡 Hint
Check the expected input type for references in sentence_bleu.
Model Choice
expert
1:30remaining
Which evaluation metric is best for machine translation quality?
You want to evaluate machine translation output quality automatically. Which metric is most widely accepted and specifically designed for this task?
AF1-score
BROUGE
CBLEU
DPerplexity
Attempts:
2 left
💡 Hint
This metric compares n-gram overlap between candidate and reference translations.