0
0
NLPml~20 mins

ROUGE evaluation metrics in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - ROUGE evaluation metrics
Problem:You have a text summarization model that generates summaries. You want to evaluate how good these summaries are by comparing them to human-written reference summaries using ROUGE scores.
Current Metrics:ROUGE-1 F1 score: 0.45, ROUGE-2 F1 score: 0.22, ROUGE-L F1 score: 0.40
Issue:The ROUGE scores are low, indicating the model summaries are not very close to the reference summaries. You want to improve the evaluation by correctly computing ROUGE scores and understanding their meaning.
Your Task
Calculate ROUGE-1, ROUGE-2, and ROUGE-L F1 scores for model-generated summaries against reference summaries using a standard Python library. Ensure the scores are correctly computed and interpreted.
Use the 'rouge-score' Python package for evaluation.
Do not change the model or summaries, only focus on evaluation.
Code must be runnable and produce ROUGE scores.
Hint 1
Hint 2
Hint 3
Solution
NLP
from rouge_score import rouge_scorer

# Example reference and model summaries
reference = "The cat sat on the mat."
prediction = "The cat is sitting on the mat."

# Initialize scorer for ROUGE-1, ROUGE-2, and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Compute scores
scores = scorer.score(reference, prediction)

# Print F1 scores
for key in scores:
    print(f"{key} F1 score: {scores[key].fmeasure:.2f}")
Used the 'rouge-score' Python package for accurate ROUGE calculation.
Included ROUGE-1, ROUGE-2, and ROUGE-L metrics for comprehensive evaluation.
Enabled stemming to improve matching of word forms.
Results Interpretation

Before: ROUGE-1 F1 = 0.45, ROUGE-2 F1 = 0.22, ROUGE-L F1 = 0.40
After: ROUGE-1 F1 = 0.77, ROUGE-2 F1 = 0.55, ROUGE-L F1 = 0.77

Using a proper ROUGE evaluation method with stemming and correct metrics significantly improves the accuracy of summary quality measurement. Higher ROUGE scores mean the model summary is closer to the reference.
Bonus Experiment
Try evaluating multiple summaries at once and compute average ROUGE scores over a dataset.
💡 Hint
Loop over pairs of reference and predicted summaries, compute ROUGE scores for each, then average the F1 scores.