0
0
NLPml~12 mins

Evaluating generated text (BLEU, ROUGE) in NLP - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Evaluating generated text (BLEU, ROUGE)

This pipeline shows how we check the quality of text generated by a computer. We compare the generated text to a correct example using scores called BLEU and ROUGE.

Data Flow - 5 Stages
1Input Texts
1000 pairs of sentencesCollect pairs of generated text and reference text1000 pairs of sentences
Generated: 'The cat sat on the mat.' Reference: 'A cat is sitting on the mat.'
2Tokenization
1000 pairs of sentencesSplit sentences into words (tokens)1000 pairs of token lists
['The', 'cat', 'sat', 'on', 'the', 'mat'] and ['A', 'cat', 'is', 'sitting', 'on', 'the', 'mat']
3Calculate BLEU Score
1000 pairs of token listsCompare n-grams between generated and reference texts to get BLEU score1000 BLEU scores (0 to 1)
BLEU score: 0.65
4Calculate ROUGE Score
1000 pairs of token listsCompare overlapping units like n-grams and longest common subsequence for ROUGE scores1000 ROUGE scores (Recall, Precision, F1)
ROUGE-L F1 score: 0.70
5Aggregate Scores
1000 BLEU and ROUGE scoresCalculate average scores to summarize model performanceAverage BLEU and ROUGE scores
Average BLEU: 0.62, Average ROUGE-L F1: 0.68
Training Trace - Epoch by Epoch

Loss
1.0 |***************
0.8 |************
0.6 |********
0.4 |*****
0.2 |**
0.0 +----------------
     1 2 3 4 5 Epochs
EpochLoss ↓Accuracy ↑Observation
10.850.40Initial evaluation shows low BLEU and ROUGE scores indicating poor text quality.
20.650.55Scores improve as the model learns to generate more similar text.
30.500.68Better matching of n-grams and sequences reflected in higher BLEU and ROUGE.
40.400.75Model generates more fluent and relevant text, scores continue to rise.
50.350.80Training converges with good BLEU and ROUGE scores showing quality text generation.
Prediction Trace - 5 Layers
Layer 1: Input Generated Sentence
Layer 2: Input Reference Sentence
Layer 3: Calculate BLEU Score
Layer 4: Calculate ROUGE Score
Layer 5: Final Evaluation
Model Quiz - 3 Questions
Test your understanding
What does the BLEU score mainly measure in generated text?
AThe length of the generated text
BMatching groups of words (n-grams) with the reference text
CThe number of unique words in the generated text
DThe grammatical correctness of the generated text
Key Insight
BLEU and ROUGE scores help us measure how close generated text is to a reference. By tracking these scores during training, we see the model improve in producing more accurate and fluent text.