What is Evaluating generated text (BLEU, ROUGE) in NLP?

NLPml~5 mins

Evaluating generated text (BLEU, ROUGE) in NLP

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

We use BLEU and ROUGE to check how good a computer's text is compared to human writing. They help us see if the computer's words make sense and match what we expect.

Checking if a machine translation sounds like a real human translation.

Measuring how well a chatbot's reply matches a good answer.

Comparing summaries made by a computer to summaries written by people.

Evaluating text generated by AI for stories or articles.

Testing improvements in text generation models during training.

Syntax

NLP

from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

# BLEU score example
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'test']
bleu_score = sentence_bleu(reference, candidate)

# ROUGE score example
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('this is test', 'this is a test')

BLEU compares words and phrases between candidate and reference texts.

ROUGE focuses on overlapping words and sequences, often used for summaries.

Examples

Calculates BLEU score for a candidate sentence against one reference.

NLP

reference = [['the', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
bleu = sentence_bleu(reference, candidate)
print(f'BLEU score: {bleu:.2f}')

Computes ROUGE-1 and ROUGE-L scores between two sentences.

NLP

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('the cat sat on the mat', 'the cat is on the mat')
print(scores)

Sample Model

This program compares a candidate sentence to a reference using BLEU and ROUGE. Since both texts are the same, scores will be high.

NLP

from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

reference = [['the', 'quick', 'brown', 'fox']]
candidate = ['the', 'quick', 'brown', 'fox']

bleu = sentence_bleu(reference, candidate)

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('the quick brown fox', 'the quick brown fox')

print(f'BLEU score: {bleu:.2f}')
print('ROUGE scores:', scores)

OutputSuccess

Important Notes

BLEU scores range from 0 to 1, where 1 means perfect match.

ROUGE gives multiple scores; F-measure balances precision and recall.

Both metrics work best with multiple reference texts for fair comparison.

Summary

BLEU and ROUGE help measure how close generated text is to human text.

BLEU focuses on matching phrases; ROUGE focuses on overlapping words and sequences.

Use these scores to improve and compare text generation models.

Practice

(1/5)

1. What is the main purpose of BLEU and ROUGE scores in evaluating generated text?

easy

A. To measure how similar the generated text is to human-written text

B. To check the spelling errors in generated text

C. To count the number of words in the generated text

D. To translate text from one language to another

Evaluating generated text (BLEU, ROUGE) in NLP

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of BLEU and ROUGE

Step 2: Identify the main purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall the nltk BLEU function syntax

Step 2: Match the correct syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand BLEU calculation basics

Step 2: Run or estimate BLEU score

Final Answer:

Quick Check:

Solution

Step 1: Analyze the error message

Step 2: Understand correct usage

Final Answer:

Quick Check:

Solution

Step 1: Understand BLEU and ROUGE focus

Step 2: Compare scores for phrase matching

Final Answer:

Quick Check: