0
0
NLPml~20 mins

ROUGE evaluation metrics in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
ROUGE Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Understanding ROUGE-N metric

What does the ROUGE-N metric primarily measure in text summarization evaluation?

AThe semantic similarity using word embeddings
BThe grammatical correctness of the generated summary
CThe length difference between generated and reference summaries
DThe overlap of n-grams between the generated summary and reference summary
Attempts:
2 left
💡 Hint

Think about what 'n-gram' means and what ROUGE-N counts.

Predict Output
intermediate
2:00remaining
ROUGE-1 score calculation output

Given the following Python code snippet calculating ROUGE-1 recall, what is the printed output?

NLP
from collections import Counter

def rouge_1_recall(candidate, reference):
    candidate_tokens = candidate.split()
    reference_tokens = reference.split()
    ref_counts = Counter(reference_tokens)
    cand_counts = Counter(candidate_tokens)
    overlap = sum(min(cand_counts[w], ref_counts[w]) for w in cand_counts)
    recall = overlap / len(reference_tokens)
    return recall

candidate = "the cat sat on the mat"
reference = "the cat is on the mat"
print(round(rouge_1_recall(candidate, reference), 2))
A0.83
B0.67
C0.71
D0.57
Attempts:
2 left
💡 Hint

Count overlapping words and divide by total reference words.

Model Choice
advanced
1:30remaining
Choosing ROUGE variant for phrase-level matching

You want to evaluate summaries focusing on matching longer phrases rather than single words. Which ROUGE variant is best suited?

AROUGE-1
BROUGE-S
CROUGE-2
DROUGE-L
Attempts:
2 left
💡 Hint

Consider which metric uses 2-grams (pairs of words).

Metrics
advanced
1:30remaining
Interpreting ROUGE-L score meaning

What does a high ROUGE-L score indicate about the generated summary compared to the reference?

AThe generated summary shares many common subsequences with the reference, preserving sentence structure
BThe generated summary has many matching individual words but in different order
CThe generated summary is much shorter than the reference
DThe generated summary uses synonyms of the reference words
Attempts:
2 left
💡 Hint

ROUGE-L uses longest common subsequence (LCS) to evaluate.

🔧 Debug
expert
2:00remaining
Identifying error in ROUGE-2 precision calculation code

What error does the following code raise when calculating ROUGE-2 precision?

NLP
from collections import Counter

def rouge_2_precision(candidate, reference):
    def bigrams(text):
        return [text[i:i+2] for i in range(len(text)-1)]
    candidate_bigrams = bigrams(candidate.split())
    reference_bigrams = bigrams(reference.split())
    cand_counts = Counter(candidate_bigrams)
    ref_counts = Counter(reference_bigrams)
    overlap = sum(min(cand_counts[bg], ref_counts[bg]) for bg in cand_counts)
    precision = overlap / len(candidate_bigrams)
    return precision

candidate = "the cat sat on the mat"
reference = "the cat is on the mat"
print(round(rouge_2_precision(candidate, reference), 2))
AZeroDivisionError
BNo error, outputs 0.60
CTypeError
DIndexError
Attempts:
2 left
💡 Hint

Check if candidate_bigrams list is empty before division.