0
0
NLPml~5 mins

ROUGE evaluation metrics in NLP

Choose your learning style9 modes available
Introduction
ROUGE helps us check how good a computer summary is by comparing it to a human summary. It measures how much they overlap in words or phrases.
When you want to see how well a machine-made summary matches a human summary.
When testing different text summarization methods to find the best one.
When evaluating chatbots or AI that generate text to check quality.
When comparing translations or paraphrases to original text.
When measuring improvements after changing your text generation model.
Syntax
NLP
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_text, generated_text)
You create a scorer object specifying which ROUGE types to use, like 'rouge1' or 'rougeL'.
The score method compares two texts and returns precision, recall, and F1 scores.
Examples
This compares two simple sentences using ROUGE-1, which looks at overlapping single words.
NLP
scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
scores = scorer.score('The cat sat on the mat.', 'The cat is on the mat.')
This uses ROUGE-L, which measures longest common subsequence, without stemming words.
NLP
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=False)
scores = scorer.score('A quick brown fox.', 'A quick fox.')
Sample Model
This code compares a reference sentence and a generated sentence using ROUGE-1 and ROUGE-L with stemming. It prints precision, recall, and F1 scores for each metric.
NLP
from rouge_score import rouge_scorer

reference = "The quick brown fox jumps over the lazy dog."
generated = "A fast brown fox leaps over a lazy dog."

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, generated)

print(f"ROUGE-1: Precision={scores['rouge1'].precision:.2f}, Recall={scores['rouge1'].recall:.2f}, F1={scores['rouge1'].fmeasure:.2f}")
print(f"ROUGE-L: Precision={scores['rougeL'].precision:.2f}, Recall={scores['rougeL'].recall:.2f}, F1={scores['rougeL'].fmeasure:.2f}")
OutputSuccess
Important Notes
ROUGE-1 counts overlapping single words between texts.
ROUGE-L looks at the longest sequence of words shared in order.
Stemming helps match words with the same root, like 'jumps' and 'jump'.
Summary
ROUGE measures how much a generated text matches a reference text.
It gives scores for precision, recall, and F1 to show quality.
Common ROUGE types are ROUGE-1 (words) and ROUGE-L (longest sequence).