ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It measures how well a computer summary matches a human summary by counting overlapping words or phrases. The main ROUGE metrics are ROUGE-N (overlapping n-grams), ROUGE-L (longest common subsequence), and ROUGE-S (skip-bigrams). ROUGE focuses on recall because it checks how much of the human summary is captured by the machine summary. This helps us know if the important parts are included.
ROUGE evaluation metrics in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
ROUGE does not use a confusion matrix like classification. Instead, it counts overlapping units between summaries.
Human summary: "The cat sat on the mat"
Machine summary: "The cat is on the mat"
ROUGE-1 (unigrams) overlap: "The", "cat", "on", "the", "mat" = 5
Total human unigrams: 6
ROUGE-1 Recall = Overlap / Human unigrams = 5 / 6 ≈ 0.83
ROUGE-1 Precision = Overlap / Machine unigrams = 5 / 5 = 1.0
ROUGE-1 F1 = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.90
ROUGE recall measures how much of the human summary is covered by the machine summary. High recall means the machine summary includes most important info.
ROUGE precision measures how much of the machine summary is relevant to the human summary. High precision means the machine summary is focused and not adding unrelated info.
Example: If a machine summary is very long and repeats many words, recall may be high but precision low. If it is very short, precision may be high but recall low.
For summarization, recall is often more important to ensure key info is not missed, but precision helps keep summaries concise.
Good ROUGE scores are closer to 1.0, meaning strong overlap with human summary.
- ROUGE-1 F1 above 0.5 is decent for many tasks.
- ROUGE-L above 0.4 shows good sequence matching.
- Scores below 0.3 usually mean poor summary quality.
However, very high ROUGE (near 1.0) may mean the machine summary is copying the human summary exactly, which is not always desired.
- Overfitting: Models may memorize training summaries, inflating ROUGE scores but not generalizing.
- Ignoring meaning: ROUGE counts words but does not understand meaning, so paraphrased good summaries may score low.
- Length bias: Longer summaries tend to have higher recall but lower precision.
- Data leakage: Using test summaries in training can falsely boost ROUGE.
- Single reference: Using only one human summary limits ROUGE's reliability; multiple references improve it.
Your summarization model has ROUGE-1 recall of 0.95 but precision of 0.3. Is it good for production? Why or why not?
Answer: This means the model includes almost all important words (high recall) but also adds many unrelated words (low precision). The summary may be too long or noisy. It is not ideal for production because users want concise, relevant summaries. You should improve precision while keeping recall high.
Practice
Solution
Step 1: Understand ROUGE's purpose
ROUGE is designed to compare generated text with a reference to check similarity.Step 2: Identify what ROUGE measures
It measures how much the generated text overlaps with the reference text in terms of words or sequences.Final Answer:
The overlap between generated text and reference text -> Option CQuick Check:
ROUGE = overlap measure [OK]
- Confusing ROUGE with grammar checkers
- Thinking ROUGE measures sentiment
- Assuming ROUGE measures generation speed
Solution
Step 1: Recall definition in ROUGE-1
Recall measures how much of the reference text's unigrams appear in the generated text.Step 2: Apply recall formula
Recall = overlapping unigrams / total unigrams in reference text.Final Answer:
Number of overlapping unigrams divided by total unigrams in reference text -> Option BQuick Check:
Recall = overlap/reference [OK]
- Mixing up recall with precision
- Using generated text count in recall
- Confusing unigrams with bigrams
"the cat sat on the mat" and generated text: "the cat lay on rug", what is the ROUGE-1 precision score?Solution
Step 1: Identify overlapping unigrams
Common words: "the", "cat", "on". Overlapping unigrams = 3: "the", "cat", "on".Step 2: Calculate precision
Precision = overlapping unigrams / total unigrams in generated text = 3 / 5 = 0.6.Final Answer:
0.6 -> Option AQuick Check:
Precision = 3/5 = 0.6 [OK]
- Counting duplicates incorrectly
- Using reference text length for precision
- Ignoring repeated words in calculation
Solution
Step 1: Understand ROUGE-L calculation
ROUGE-L depends on longest common subsequence of tokens, so tokenization is essential.Step 2: Identify impact of missing tokenization
If texts are not tokenized, comparison fails, resulting in zero scores.Final Answer:
Not tokenizing the texts before comparison -> Option DQuick Check:
Tokenization missing = zero ROUGE-L [OK]
- Skipping tokenization step
- Confusing ROUGE types
- Ignoring case normalization impact
Solution
Step 1: Understand the problem context
The summaries are short and miss many reference words, so coverage of reference is low.Step 2: Choose metric that measures coverage
Recall measures how much of the reference text is captured by the summary, so ROUGE-1 recall is best.Final Answer:
ROUGE-1 recall, because it shows how many reference words are captured -> Option AQuick Check:
Coverage = recall = ROUGE-1 recall [OK]
- Focusing on precision instead of recall
- Using ROUGE-2 which is stricter
- Ignoring recall's role in coverage
