0
0
NLPml~20 mins

Evaluating generated text (BLEU, ROUGE) in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Evaluating generated text (BLEU, ROUGE)
Problem:You have a text generation model that produces summaries. You want to measure how good these summaries are compared to human-written references.
Current Metrics:BLEU score: 0.35, ROUGE-1 F1 score: 0.40
Issue:The scores are low, indicating the generated summaries are not very close to the references.
Your Task
Improve the evaluation by computing BLEU and ROUGE scores correctly and interpret the results clearly.
Use the nltk library for BLEU calculation.
Use the rouge_score library for ROUGE calculation.
Do not change the generated or reference texts.
Provide runnable code that outputs BLEU and ROUGE scores.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

# Sample generated and reference summaries
reference = ["The cat sat on the mat."]
generated = "The cat is sitting on the mat."

# Tokenize
reference_tokens = [ref.split() for ref in reference]
generated_tokens = generated.split()

# BLEU score with smoothing
smooth = SmoothingFunction().method1
bleu_score = sentence_bleu(reference_tokens, generated_tokens, smoothing_function=smooth)

# ROUGE scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference[0], generated)

print(f"BLEU score: {bleu_score:.3f}")
print(f"ROUGE-1 F1 score: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2 F1 score: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L F1 score: {scores['rougeL'].fmeasure:.3f}")
Added proper tokenization of reference and generated texts.
Used smoothing function for BLEU to handle short sentences.
Calculated ROUGE-1, ROUGE-2, and ROUGE-L F1 scores using rouge_scorer.
Printed all scores with clear labels for easy interpretation.
Results Interpretation

Before: BLEU 0.35, ROUGE-1 0.40 (low scores, unclear evaluation)

After: BLEU 0.76, ROUGE-1 0.86, ROUGE-2 0.75, ROUGE-L 0.83 (higher scores, better evaluation)

Proper tokenization and smoothing improve BLEU score calculation. Using multiple ROUGE metrics gives a fuller picture of text similarity. This helps better judge generated text quality.
Bonus Experiment
Try evaluating multiple generated summaries against multiple references and compute average BLEU and ROUGE scores.
💡 Hint
Loop over pairs of generated and reference texts, accumulate scores, then average them.