NLPml~8 mins

Extractive summarization in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Extractive summarization

Which metric matters for extractive summarization and WHY

For extractive summarization, ROUGE scores are the most important metrics. ROUGE compares the overlap of words or phrases between the model's summary and a human-written summary. It tells us how much the model's output matches the important parts of the original text.

Specifically, ROUGE-1 measures overlap of single words, ROUGE-2 measures overlap of two-word pairs, and ROUGE-L measures longest common subsequence. These help us see if the summary captures key content accurately.

Accuracy or precision alone are less useful because summarization is about content coverage and relevance, not just classification correctness.

Confusion matrix or equivalent visualization

Extractive summarization does not use a confusion matrix like classification. Instead, we visualize overlap with ROUGE scores.

Reference summary: "The cat sat on the mat."
Model summary:     "Cat sat on mat."

ROUGE-1 (unigram overlap): 4/5 words matched = 0.8
ROUGE-2 (bigram overlap): 3/4 pairs matched = 0.75
ROUGE-L (longest common subsequence): 4 words = 0.8

This shows how much the model summary matches the reference in simple terms.

Precision vs Recall tradeoff with examples

In extractive summarization, precision means how many words in the model summary are actually important (found in the reference). Recall means how many important words from the reference summary the model captured.

High precision but low recall means the summary is very accurate but misses many key points (too short or incomplete). High recall but low precision means the summary covers many key points but also includes irrelevant info (too long or noisy).

Example:

High precision, low recall: Model summary "Cat sat." (only important words, but misses "on the mat")
High recall, low precision: Model summary "The cat sat on the mat and played." (covers all key words but adds extra)

Good summaries balance precision and recall for clear, concise, and complete content.

What "good" vs "bad" metric values look like for extractive summarization

Good ROUGE scores are typically above 0.5 for ROUGE-1 and ROUGE-L in simple datasets, showing strong overlap with human summaries.

Bad scores are below 0.3, indicating poor content coverage or irrelevant summaries.

Example:

Good: ROUGE-1 = 0.65, ROUGE-2 = 0.55, ROUGE-L = 0.60
Bad: ROUGE-1 = 0.25, ROUGE-2 = 0.10, ROUGE-L = 0.20

Higher scores mean the summary better matches important content from the original text.

Common pitfalls in metrics for extractive summarization

Overfitting: Model memorizes training summaries, scoring high on ROUGE but failing on new texts.
Data leakage: Using test summaries during training inflates ROUGE scores falsely.
Ignoring summary length: Very short summaries can get high precision but miss content (low recall).
ROUGE limitations: ROUGE measures overlap but not readability or coherence.
Accuracy paradox: High accuracy on trivial summaries doesn't mean good summarization.

Self-check question

Your extractive summarization model has a ROUGE-1 score of 0.85 but a ROUGE-2 score of 0.30. Is this good? Why or why not?

Answer: This means the model captures many important single words (high ROUGE-1) but struggles with word pairs or phrase structure (low ROUGE-2). It may produce summaries with correct words but poor flow or missing key phrases. So, it is partially good but needs improvement in capturing meaningful word sequences.

Key Result

ROUGE scores (especially ROUGE-1 and ROUGE-2) best measure extractive summarization quality by showing word and phrase overlap with human summaries.

Practice

(1/5)

1. What is the main goal of extractive summarization in NLP?

easy

A. To translate the text into another language

B. To rewrite the text using simpler words

C. To select important sentences from the original text to create a summary

D. To generate new sentences that explain the text

Extractive summarization in NLP - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand extractive summarization

Step 2: Compare options

Final Answer:

Quick Check:

Solution

Step 1: Identify techniques for extractive summarization

Step 2: Eliminate unrelated options

Final Answer:

Quick Check:

Solution

Step 1: Understand TF-IDF vectorization and summing

Step 2: Calculate approximate sums

Final Answer:

Quick Check:

Solution

Step 1: Check score filtering condition

Step 2: Determine which sentences are included

Final Answer:

Quick Check:

Solution

Step 1: Identify top 2 scores

Step 2: Match scores to sentences

Final Answer:

Quick Check: