0
0
NLPml~5 mins

Evaluating generated text (BLEU, ROUGE) in NLP - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What does BLEU score measure in text generation?
BLEU (Bilingual Evaluation Understudy) measures how closely the generated text matches one or more reference texts by comparing overlapping n-grams.
Click to reveal answer
beginner
What is ROUGE used for in evaluating generated text?
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap of units such as n-grams, word sequences, and word pairs between the generated text and reference summaries, focusing on recall.
Click to reveal answer
intermediate
Why is BLEU considered precision-oriented while ROUGE is recall-oriented?
BLEU focuses on how much of the generated text matches the reference (precision), while ROUGE focuses on how much of the reference text is covered by the generated text (recall).
Click to reveal answer
beginner
What is an n-gram in the context of BLEU and ROUGE?
An n-gram is a sequence of 'n' words in a row. For example, a 2-gram (bigram) is two consecutive words. Both BLEU and ROUGE compare these sequences between generated and reference texts.
Click to reveal answer
intermediate
How does BLEU handle multiple reference texts?
BLEU compares the generated text against multiple reference texts and uses the best matching n-grams from any reference to calculate the score, improving evaluation accuracy.
Click to reveal answer
What does a high BLEU score indicate?
AThe generated text is very different from the reference
BThe generated text closely matches the reference text
CThe generated text is longer than the reference
DThe generated text has many spelling errors
Which metric is more focused on recall in text evaluation?
ABLEU
BF1 Score
CAccuracy
DROUGE
What is an n-gram?
AA sequence of n words in a row
BA single word
CA type of neural network
DA punctuation mark
Which of these is true about BLEU?
AIt measures recall of generated text
BIt only works with one reference text
CIt measures precision of generated text
DIt ignores word order
ROUGE is commonly used to evaluate which type of generated text?
AText summarization
BSpeech recognition
CMachine translation
DImage captioning
Explain how BLEU and ROUGE differ in evaluating generated text.
Think about what each metric focuses on: matching generated text vs. covering reference text.
You got /4 concepts.
    Describe what an n-gram is and why it is important for BLEU and ROUGE.
    Consider how small word groups help check if texts are similar.
    You got /3 concepts.