What is ROUGE evaluation metrics in NLP?

NLPml~5 mins

ROUGE evaluation metrics in NLP

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Introduction

ROUGE helps us check how good a computer summary is by comparing it to a human summary. It measures how much they overlap in words or phrases.

When you want to see how well a machine-made summary matches a human summary.

When testing different text summarization methods to find the best one.

When evaluating chatbots or AI that generate text to check quality.

When comparing translations or paraphrases to original text.

When measuring improvements after changing your text generation model.

Syntax

NLP

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_text, generated_text)

You create a scorer object specifying which ROUGE types to use, like 'rouge1' or 'rougeL'.

The score method compares two texts and returns precision, recall, and F1 scores.

Examples

This compares two simple sentences using ROUGE-1, which looks at overlapping single words.

NLP

scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
scores = scorer.score('The cat sat on the mat.', 'The cat is on the mat.')

This uses ROUGE-L, which measures longest common subsequence, without stemming words.

NLP

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=False)
scores = scorer.score('A quick brown fox.', 'A quick fox.')

Sample Model

This code compares a reference sentence and a generated sentence using ROUGE-1 and ROUGE-L with stemming. It prints precision, recall, and F1 scores for each metric.

NLP

from rouge_score import rouge_scorer

reference = "The quick brown fox jumps over the lazy dog."
generated = "A fast brown fox leaps over a lazy dog."

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, generated)

print(f"ROUGE-1: Precision={scores['rouge1'].precision:.2f}, Recall={scores['rouge1'].recall:.2f}, F1={scores['rouge1'].fmeasure:.2f}")
print(f"ROUGE-L: Precision={scores['rougeL'].precision:.2f}, Recall={scores['rougeL'].recall:.2f}, F1={scores['rougeL'].fmeasure:.2f}")

OutputSuccess

Important Notes

ROUGE-1 counts overlapping single words between texts.

ROUGE-L looks at the longest sequence of words shared in order.

Stemming helps match words with the same root, like 'jumps' and 'jump'.

Summary

ROUGE measures how much a generated text matches a reference text.

It gives scores for precision, recall, and F1 to show quality.

Common ROUGE types are ROUGE-1 (words) and ROUGE-L (longest sequence).