Practice

(1/5)

1. What does the ROUGE metric primarily measure in natural language processing?

easy

A. The sentiment of the generated text

B. The speed of text generation

C. The overlap between generated text and reference text

D. The grammatical correctness of text

Solution

Step 1: Understand ROUGE's purpose
ROUGE is designed to compare generated text with a reference to check similarity.
Step 2: Identify what ROUGE measures
It measures how much the generated text overlaps with the reference text in terms of words or sequences.
Final Answer:
The overlap between generated text and reference text -> Option C
Quick Check:
ROUGE = overlap measure [OK]

Hint: ROUGE checks text similarity, not speed or grammar [OK]

Common Mistakes:

Confusing ROUGE with grammar checkers
Thinking ROUGE measures sentiment
Assuming ROUGE measures generation speed

2. Which of the following is the correct way to calculate ROUGE-1 recall?

easy

A. Number of overlapping unigrams divided by total unigrams in generated text

B. Number of overlapping unigrams divided by total unigrams in reference text

C. Number of overlapping bigrams divided by total bigrams in generated text

D. Number of overlapping bigrams divided by total bigrams in reference text

Solution

Step 1: Recall definition in ROUGE-1
Recall measures how much of the reference text's unigrams appear in the generated text.
Step 2: Apply recall formula
Recall = overlapping unigrams / total unigrams in reference text.
Final Answer:
Number of overlapping unigrams divided by total unigrams in reference text -> Option B
Quick Check:
Recall = overlap/reference [OK]

Hint: Recall divides by reference text count, not generated [OK]

Common Mistakes:

Mixing up recall with precision
Using generated text count in recall
Confusing unigrams with bigrams

3. Given the reference text: "the cat sat on the mat" and generated text: "the cat lay on rug", what is the ROUGE-1 precision score?

medium

A. 0.6

B. 0.5

C. 0.4

D. 0.7

Solution

Step 1: Identify overlapping unigrams
Common words: "the", "cat", "on". Overlapping unigrams = 3: "the", "cat", "on".
Step 2: Calculate precision
Precision = overlapping unigrams / total unigrams in generated text = 3 / 5 = 0.6.
Final Answer:
0.6 -> Option A
Quick Check:
Precision = 3/5 = 0.6 [OK]

Hint: Precision = overlap / generated text words count [OK]

Common Mistakes:

Counting duplicates incorrectly
Using reference text length for precision
Ignoring repeated words in calculation

4. You wrote code to compute ROUGE-L but the scores are always zero. Which of these is the most likely bug?

medium

A. Calculating precision instead of recall

B. Using ROUGE-1 instead of ROUGE-L

C. Using lowercase text for both inputs

D. Not tokenizing the texts before comparison

Solution

Step 1: Understand ROUGE-L calculation
ROUGE-L depends on longest common subsequence of tokens, so tokenization is essential.
Step 2: Identify impact of missing tokenization
If texts are not tokenized, comparison fails, resulting in zero scores.
Final Answer:
Not tokenizing the texts before comparison -> Option D
Quick Check:
Tokenization missing = zero ROUGE-L [OK]

Hint: Always tokenize texts before ROUGE-L calculation [OK]

Common Mistakes:

Skipping tokenization step
Confusing ROUGE types
Ignoring case normalization impact

5. You want to evaluate a summarization model using ROUGE scores. The model produces very short summaries missing many reference words. Which ROUGE metric and score should you focus on to best understand coverage?

hard

A. ROUGE-1 recall, because it shows how many reference words are captured

B. ROUGE-1 precision, because it shows how many generated words are correct

C. ROUGE-L F1, because it balances precision and recall on longest sequences

D. ROUGE-2 precision, because it focuses on bigram accuracy

Solution

Step 1: Understand the problem context
The summaries are short and miss many reference words, so coverage of reference is low.
Step 2: Choose metric that measures coverage
Recall measures how much of the reference text is captured by the summary, so ROUGE-1 recall is best.
Final Answer:
ROUGE-1 recall, because it shows how many reference words are captured -> Option A
Quick Check:
Coverage = recall = ROUGE-1 recall [OK]

Hint: Use ROUGE-1 recall to check coverage of reference words [OK]

Common Mistakes:

Focusing on precision instead of recall
Using ROUGE-2 which is stricter
Ignoring recall's role in coverage

ROUGE evaluation metrics in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand ROUGE's purpose

Step 2: Identify what ROUGE measures

Final Answer:

Quick Check:

Solution

Step 1: Recall definition in ROUGE-1

Step 2: Apply recall formula

Final Answer:

Quick Check:

Solution

Step 1: Identify overlapping unigrams

Step 2: Calculate precision

Final Answer:

Quick Check:

Solution

Step 1: Understand ROUGE-L calculation

Step 2: Identify impact of missing tokenization

Final Answer:

Quick Check:

Solution

Step 1: Understand the problem context

Step 2: Choose metric that measures coverage

Final Answer:

Quick Check: