Bird
Raised Fist0
NLPml~20 mins

ROUGE evaluation metrics in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
ROUGE Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Understanding ROUGE-N metric

What does the ROUGE-N metric primarily measure in text summarization evaluation?

AThe semantic similarity using word embeddings
BThe grammatical correctness of the generated summary
CThe length difference between generated and reference summaries
DThe overlap of n-grams between the generated summary and reference summary
Attempts:
2 left
💡 Hint

Think about what 'n-gram' means and what ROUGE-N counts.

Predict Output
intermediate
2:00remaining
ROUGE-1 score calculation output

Given the following Python code snippet calculating ROUGE-1 recall, what is the printed output?

NLP
from collections import Counter

def rouge_1_recall(candidate, reference):
    candidate_tokens = candidate.split()
    reference_tokens = reference.split()
    ref_counts = Counter(reference_tokens)
    cand_counts = Counter(candidate_tokens)
    overlap = sum(min(cand_counts[w], ref_counts[w]) for w in cand_counts)
    recall = overlap / len(reference_tokens)
    return recall

candidate = "the cat sat on the mat"
reference = "the cat is on the mat"
print(round(rouge_1_recall(candidate, reference), 2))
A0.83
B0.67
C0.71
D0.57
Attempts:
2 left
💡 Hint

Count overlapping words and divide by total reference words.

Model Choice
advanced
1:30remaining
Choosing ROUGE variant for phrase-level matching

You want to evaluate summaries focusing on matching longer phrases rather than single words. Which ROUGE variant is best suited?

AROUGE-1
BROUGE-S
CROUGE-2
DROUGE-L
Attempts:
2 left
💡 Hint

Consider which metric uses 2-grams (pairs of words).

Metrics
advanced
1:30remaining
Interpreting ROUGE-L score meaning

What does a high ROUGE-L score indicate about the generated summary compared to the reference?

AThe generated summary shares many common subsequences with the reference, preserving sentence structure
BThe generated summary has many matching individual words but in different order
CThe generated summary is much shorter than the reference
DThe generated summary uses synonyms of the reference words
Attempts:
2 left
💡 Hint

ROUGE-L uses longest common subsequence (LCS) to evaluate.

🔧 Debug
expert
2:00remaining
Identifying error in ROUGE-2 precision calculation code

What error does the following code raise when calculating ROUGE-2 precision?

NLP
from collections import Counter

def rouge_2_precision(candidate, reference):
    def bigrams(text):
        return [text[i:i+2] for i in range(len(text)-1)]
    candidate_bigrams = bigrams(candidate.split())
    reference_bigrams = bigrams(reference.split())
    cand_counts = Counter(candidate_bigrams)
    ref_counts = Counter(reference_bigrams)
    overlap = sum(min(cand_counts[bg], ref_counts[bg]) for bg in cand_counts)
    precision = overlap / len(candidate_bigrams)
    return precision

candidate = "the cat sat on the mat"
reference = "the cat is on the mat"
print(round(rouge_2_precision(candidate, reference), 2))
AZeroDivisionError
BNo error, outputs 0.60
CTypeError
DIndexError
Attempts:
2 left
💡 Hint

Check if candidate_bigrams list is empty before division.

Practice

(1/5)
1. What does the ROUGE metric primarily measure in natural language processing?
easy
A. The sentiment of the generated text
B. The speed of text generation
C. The overlap between generated text and reference text
D. The grammatical correctness of text

Solution

  1. Step 1: Understand ROUGE's purpose

    ROUGE is designed to compare generated text with a reference to check similarity.
  2. Step 2: Identify what ROUGE measures

    It measures how much the generated text overlaps with the reference text in terms of words or sequences.
  3. Final Answer:

    The overlap between generated text and reference text -> Option C
  4. Quick Check:

    ROUGE = overlap measure [OK]
Hint: ROUGE checks text similarity, not speed or grammar [OK]
Common Mistakes:
  • Confusing ROUGE with grammar checkers
  • Thinking ROUGE measures sentiment
  • Assuming ROUGE measures generation speed
2. Which of the following is the correct way to calculate ROUGE-1 recall?
easy
A. Number of overlapping unigrams divided by total unigrams in generated text
B. Number of overlapping unigrams divided by total unigrams in reference text
C. Number of overlapping bigrams divided by total bigrams in generated text
D. Number of overlapping bigrams divided by total bigrams in reference text

Solution

  1. Step 1: Recall definition in ROUGE-1

    Recall measures how much of the reference text's unigrams appear in the generated text.
  2. Step 2: Apply recall formula

    Recall = overlapping unigrams / total unigrams in reference text.
  3. Final Answer:

    Number of overlapping unigrams divided by total unigrams in reference text -> Option B
  4. Quick Check:

    Recall = overlap/reference [OK]
Hint: Recall divides by reference text count, not generated [OK]
Common Mistakes:
  • Mixing up recall with precision
  • Using generated text count in recall
  • Confusing unigrams with bigrams
3. Given the reference text: "the cat sat on the mat" and generated text: "the cat lay on rug", what is the ROUGE-1 precision score?
medium
A. 0.6
B. 0.5
C. 0.4
D. 0.7

Solution

  1. Step 1: Identify overlapping unigrams

    Common words: "the", "cat", "on". Overlapping unigrams = 3: "the", "cat", "on".
  2. Step 2: Calculate precision

    Precision = overlapping unigrams / total unigrams in generated text = 3 / 5 = 0.6.
  3. Final Answer:

    0.6 -> Option A
  4. Quick Check:

    Precision = 3/5 = 0.6 [OK]
Hint: Precision = overlap / generated text words count [OK]
Common Mistakes:
  • Counting duplicates incorrectly
  • Using reference text length for precision
  • Ignoring repeated words in calculation
4. You wrote code to compute ROUGE-L but the scores are always zero. Which of these is the most likely bug?
medium
A. Calculating precision instead of recall
B. Using ROUGE-1 instead of ROUGE-L
C. Using lowercase text for both inputs
D. Not tokenizing the texts before comparison

Solution

  1. Step 1: Understand ROUGE-L calculation

    ROUGE-L depends on longest common subsequence of tokens, so tokenization is essential.
  2. Step 2: Identify impact of missing tokenization

    If texts are not tokenized, comparison fails, resulting in zero scores.
  3. Final Answer:

    Not tokenizing the texts before comparison -> Option D
  4. Quick Check:

    Tokenization missing = zero ROUGE-L [OK]
Hint: Always tokenize texts before ROUGE-L calculation [OK]
Common Mistakes:
  • Skipping tokenization step
  • Confusing ROUGE types
  • Ignoring case normalization impact
5. You want to evaluate a summarization model using ROUGE scores. The model produces very short summaries missing many reference words. Which ROUGE metric and score should you focus on to best understand coverage?
hard
A. ROUGE-1 recall, because it shows how many reference words are captured
B. ROUGE-1 precision, because it shows how many generated words are correct
C. ROUGE-L F1, because it balances precision and recall on longest sequences
D. ROUGE-2 precision, because it focuses on bigram accuracy

Solution

  1. Step 1: Understand the problem context

    The summaries are short and miss many reference words, so coverage of reference is low.
  2. Step 2: Choose metric that measures coverage

    Recall measures how much of the reference text is captured by the summary, so ROUGE-1 recall is best.
  3. Final Answer:

    ROUGE-1 recall, because it shows how many reference words are captured -> Option A
  4. Quick Check:

    Coverage = recall = ROUGE-1 recall [OK]
Hint: Use ROUGE-1 recall to check coverage of reference words [OK]
Common Mistakes:
  • Focusing on precision instead of recall
  • Using ROUGE-2 which is stricter
  • Ignoring recall's role in coverage