What does the ROUGE-N metric primarily measure in text summarization evaluation?
Think about what 'n-gram' means and what ROUGE-N counts.
ROUGE-N measures how many n-grams (continuous sequences of words) in the generated summary appear in the reference summary, showing overlap.
Given the following Python code snippet calculating ROUGE-1 recall, what is the printed output?
from collections import Counter def rouge_1_recall(candidate, reference): candidate_tokens = candidate.split() reference_tokens = reference.split() ref_counts = Counter(reference_tokens) cand_counts = Counter(candidate_tokens) overlap = sum(min(cand_counts[w], ref_counts[w]) for w in cand_counts) recall = overlap / len(reference_tokens) return recall candidate = "the cat sat on the mat" reference = "the cat is on the mat" print(round(rouge_1_recall(candidate, reference), 2))
Count overlapping words and divide by total reference words.
The overlapping words are 'the', 'cat', 'on', 'the', 'mat' (5 tokens). Reference length is 6 tokens. Recall = 5/6 ≈ 0.83.
You want to evaluate summaries focusing on matching longer phrases rather than single words. Which ROUGE variant is best suited?
Consider which metric uses 2-grams (pairs of words).
ROUGE-2 measures overlap of 2-word sequences (bigrams), capturing phrase-level matches better than ROUGE-1 which uses single words.
What does a high ROUGE-L score indicate about the generated summary compared to the reference?
ROUGE-L uses longest common subsequence (LCS) to evaluate.
ROUGE-L measures the longest common subsequence, so a high score means the generated summary preserves word order and structure similar to the reference.
What error does the following code raise when calculating ROUGE-2 precision?
from collections import Counter def rouge_2_precision(candidate, reference): def bigrams(text): return [text[i:i+2] for i in range(len(text)-1)] candidate_bigrams = bigrams(candidate.split()) reference_bigrams = bigrams(reference.split()) cand_counts = Counter(candidate_bigrams) ref_counts = Counter(reference_bigrams) overlap = sum(min(cand_counts[bg], ref_counts[bg]) for bg in cand_counts) precision = overlap / len(candidate_bigrams) return precision candidate = "the cat sat on the mat" reference = "the cat is on the mat" print(round(rouge_2_precision(candidate, reference), 2))
Check if candidate_bigrams list is empty before division.
The code correctly computes bigrams and counts overlap. Candidate has 5 bigrams, overlap is 3, so precision = 3/5 = 0.60.