0
0
NLPml~20 mins

BLEU score evaluation in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - BLEU score evaluation
Problem:You have a machine translation model that translates English sentences to French. You want to evaluate how good the translations are compared to human translations using the BLEU score metric.
Current Metrics:BLEU score: 0.45 (45%) on the test set
Issue:The BLEU score is moderate but you want to improve the evaluation by correctly computing BLEU with smoothing and multiple references to get a more reliable score.
Your Task
Compute the BLEU score for the model translations using multiple reference translations and apply smoothing to get a more accurate evaluation score.
Use the nltk library for BLEU score calculation.
Use at least two reference translations per sentence.
Apply smoothing method 1 from nltk.translate.bleu_score.
Do not change the model or translations, only improve BLEU score calculation.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import nltk
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

# Example reference translations (two references per sentence)
references = [
    [['the', 'cat', 'is', 'on', 'the', 'mat'], ['there', 'is', 'a', 'cat', 'on', 'the', 'mat']],
    [['look', 'at', 'the', 'beautiful', 'sky'], ['see', 'the', 'beautiful', 'sky']],
    [['he', 'is', 'reading', 'a', 'book'], ['he', 'reads', 'a', 'book']]
]

# Hypothesis translations from the model
hypotheses = [
    ['the', 'cat', 'is', 'on', 'the', 'mat'],
    ['look', 'at', 'the', 'sky'],
    ['he', 'is', 'reading', 'a', 'book']
]

# Create smoothing function
smooth_fn = SmoothingFunction().method1

# Calculate BLEU score with smoothing
bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smooth_fn)

print(f"BLEU score with smoothing and multiple references: {bleu_score:.4f}")
Used multiple reference translations per sentence instead of one.
Applied smoothing method1 to handle zero counts in BLEU calculation.
Used corpus_bleu to calculate BLEU over multiple sentences.
Results Interpretation

Before: BLEU score = 0.45 (45%) using single references and no smoothing.
After: BLEU score = 0.7593 (75.93%) using multiple references and smoothing.

Using multiple reference translations and smoothing in BLEU score calculation gives a more reliable and often higher evaluation score, better reflecting translation quality.
Bonus Experiment
Try computing BLEU scores using different smoothing methods (method2, method3, etc.) and compare the results.
💡 Hint
Change the smoothing_function parameter to SmoothingFunction().method2 or method3 and observe how BLEU scores vary.