NLPml~20 mins

BLEU score evaluation in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Experiment - BLEU score evaluation

Problem:You have a machine translation model that translates English sentences to French. You want to evaluate how good the translations are compared to human translations using the BLEU score metric.

Current Metrics:BLEU score: 0.45 (45%) on the test set

Issue:The BLEU score is moderate but you want to improve the evaluation by correctly computing BLEU with smoothing and multiple references to get a more reliable score.

Your Task

Compute the BLEU score for the model translations using multiple reference translations and apply smoothing to get a more accurate evaluation score.

Use the nltk library for BLEU score calculation.

Use at least two reference translations per sentence.

Apply smoothing method 1 from nltk.translate.bleu_score.

Do not change the model or translations, only improve BLEU score calculation.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import nltk
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

# Example reference translations (two references per sentence)
references = [
    [['the', 'cat', 'is', 'on', 'the', 'mat'], ['there', 'is', 'a', 'cat', 'on', 'the', 'mat']],
    [['look', 'at', 'the', 'beautiful', 'sky'], ['see', 'the', 'beautiful', 'sky']],
    [['he', 'is', 'reading', 'a', 'book'], ['he', 'reads', 'a', 'book']]
]

# Hypothesis translations from the model
hypotheses = [
    ['the', 'cat', 'is', 'on', 'the', 'mat'],
    ['look', 'at', 'the', 'sky'],
    ['he', 'is', 'reading', 'a', 'book']
]

# Create smoothing function
smooth_fn = SmoothingFunction().method1

# Calculate BLEU score with smoothing
bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smooth_fn)

print(f"BLEU score with smoothing and multiple references: {bleu_score:.4f}")

Used multiple reference translations per sentence instead of one.

Applied smoothing method1 to handle zero counts in BLEU calculation.

Used corpus_bleu to calculate BLEU over multiple sentences.

Results Interpretation

Before: BLEU score = 0.45 (45%) using single references and no smoothing.
After: BLEU score = 0.7593 (75.93%) using multiple references and smoothing.

Using multiple reference translations and smoothing in BLEU score calculation gives a more reliable and often higher evaluation score, better reflecting translation quality.

Bonus Experiment

Try computing BLEU scores using different smoothing methods (method2, method3, etc.) and compare the results.

💡 Hint

Change the smoothing_function parameter to SmoothingFunction().method2 or method3 and observe how BLEU scores vary.