0
0
NLPml~5 mins

BLEU score evaluation in NLP

Choose your learning style9 modes available
Introduction

BLEU score helps us check how good a computer's translation is by comparing it to human translations. It tells us if the computer is doing a good job.

When you want to see how well a machine translated a sentence compared to a human translation.
When testing different translation models to pick the best one.
When improving a chatbot's language by checking its responses against correct answers.
When comparing summaries or paraphrases generated by a computer to original texts.
Syntax
NLP
from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate)
print(score)

The reference is a list of correct translations (each is a list of words).

The candidate is the machine's translation (a list of words).

Examples
Compare a candidate sentence with one reference sentence.
NLP
reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate)
print(score)
Compare candidate with multiple reference sentences.
NLP
references = [['this', 'is', 'a', 'test'], ['this', 'is', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(references, candidate)
print(score)
Sample Model

This program calculates the BLEU score between one human reference and one machine candidate sentence. It shows how close the machine's sentence is to the human's.

NLP
from nltk.translate.bleu_score import sentence_bleu

# One reference translation
reference = [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']]
# Candidate translation from machine
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

# Calculate BLEU score
score = sentence_bleu(reference, candidate)

print(f"BLEU score: {score:.4f}")
OutputSuccess
Important Notes

BLEU score ranges from 0 to 1, where 1 means perfect match.

BLEU uses matching of small word groups (called n-grams) to compare sentences.

Shorter candidate sentences may get lower scores even if correct.

Summary

BLEU score measures how close a machine translation is to human translations.

It compares words and word groups between candidate and reference sentences.

Higher BLEU means better translation quality.