0
0
NLPml~5 mins

Evaluating generated text (BLEU, ROUGE) in NLP

Choose your learning style9 modes available
Introduction

We use BLEU and ROUGE to check how good a computer's text is compared to human writing. They help us see if the computer's words make sense and match what we expect.

Checking if a machine translation sounds like a real human translation.
Measuring how well a chatbot's reply matches a good answer.
Comparing summaries made by a computer to summaries written by people.
Evaluating text generated by AI for stories or articles.
Testing improvements in text generation models during training.
Syntax
NLP
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

# BLEU score example
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'test']
bleu_score = sentence_bleu(reference, candidate)

# ROUGE score example
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('this is test', 'this is a test')

BLEU compares words and phrases between candidate and reference texts.

ROUGE focuses on overlapping words and sequences, often used for summaries.

Examples
Calculates BLEU score for a candidate sentence against one reference.
NLP
reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
bleu = sentence_bleu(reference, candidate)
print(f'BLEU score: {bleu:.2f}')
Computes ROUGE-1 and ROUGE-L scores between two sentences.
NLP
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('the cat sat on the mat', 'the cat is on the mat')
print(scores)
Sample Model

This program compares a candidate sentence to a reference using BLEU and ROUGE. Since both texts are the same, scores will be high.

NLP
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

reference = [['the', 'quick', 'brown', 'fox']]
candidate = ['the', 'quick', 'brown', 'fox']

bleu = sentence_bleu(reference, candidate)

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('the quick brown fox', 'the quick brown fox')

print(f'BLEU score: {bleu:.2f}')
print('ROUGE scores:', scores)
OutputSuccess
Important Notes

BLEU scores range from 0 to 1, where 1 means perfect match.

ROUGE gives multiple scores; F-measure balances precision and recall.

Both metrics work best with multiple reference texts for fair comparison.

Summary

BLEU and ROUGE help measure how close generated text is to human text.

BLEU focuses on matching phrases; ROUGE focuses on overlapping words and sequences.

Use these scores to improve and compare text generation models.