Bird
Raised Fist0
NLPml~12 mins

BLEU score evaluation in NLP - Model Pipeline Trace

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Model Pipeline - BLEU score evaluation

This pipeline evaluates how well a machine translation model translates sentences by comparing its output to human translations using the BLEU score. The BLEU score measures similarity by checking matching words and phrases.

Data Flow - 5 Stages
1Input Sentences
100 sentencesCollect source sentences and their human reference translations100 sentences with references
Source: 'The cat sits on the mat.' Reference: 'The cat is sitting on the mat.'
2Model Translation
100 source sentencesTranslate source sentences using the machine translation model100 translated sentences
Model output: 'The cat sits on the mat.'
3Tokenization
100 translated sentences and 100 reference sentencesSplit sentences into words (tokens) for comparison100 tokenized translations and 100 tokenized references
['The', 'cat', 'sits', 'on', 'the', 'mat']
4N-gram Matching
Tokenized translations and referencesCount matching word groups (n-grams) between translation and referencesCounts of matching n-grams for each sentence
Matching bigrams: ['The cat', 'cat sits']
5BLEU Score Calculation
N-gram counts and sentence lengthsCalculate BLEU score using precision of n-grams and brevity penaltySingle BLEU score value between 0 and 1
BLEU score: 0.72
Training Trace - Epoch by Epoch
Loss: 0.85 |****     
Loss: 0.65 |******   
Loss: 0.50 |******** 
Loss: 0.40 |*********
Loss: 0.35 |*********
EpochLoss ↓Accuracy ↑Observation
10.850.40Initial training with high loss and low accuracy
20.650.55Loss decreased, accuracy improved
30.500.65Model learning better translations
40.400.72Continued improvement in translation quality
50.350.78Training converging with good accuracy
Prediction Trace - 5 Layers
Layer 1: Input Sentence
Layer 2: Model Translation
Layer 3: Tokenization
Layer 4: N-gram Matching
Layer 5: BLEU Score Calculation
Model Quiz - 3 Questions
Test your understanding
What does the BLEU score measure in this pipeline?
AHow similar the model translation is to human references
BHow fast the model translates sentences
CThe number of words in the source sentence
DThe length of the translated sentence
Key Insight
BLEU score is a useful way to measure how close a machine translation is to human translations by checking matching words and phrases. During training, as the model learns, loss decreases and accuracy improves, leading to better BLEU scores.

Practice

(1/5)
1. What does the BLEU score primarily measure in machine translation?
easy
A. How close the machine translation is to human translations
B. The speed of the translation process
C. The number of words in the translated sentence
D. The grammar correctness of the translation

Solution

  1. Step 1: Understand BLEU score purpose

    BLEU score is designed to compare machine translations to human reference translations.
  2. Step 2: Identify what BLEU measures

    It measures similarity in words and phrases, not speed or grammar correctness.
  3. Final Answer:

    How close the machine translation is to human translations -> Option A
  4. Quick Check:

    BLEU = similarity to human translations [OK]
Hint: BLEU = closeness to human translation quality [OK]
Common Mistakes:
  • Confusing BLEU with translation speed
  • Thinking BLEU measures grammar correctness
  • Assuming BLEU counts total words only
2. Which of the following is the correct way to calculate the BLEU score using NLTK in Python?
easy
A. bleu_score = nltk.bleu_score(candidate, [reference])
B. bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate)
C. bleu_score = nltk.translate.bleu_score(candidate, reference)
D. bleu_score = nltk.translate.bleu_score.score(candidate, reference)

Solution

  1. Step 1: Recall NLTK BLEU function syntax

    The correct function is sentence_bleu and it takes a list of references and a candidate sentence.
  2. Step 2: Match correct argument order

    References must be a list of lists, candidate is a list of tokens.
  3. Final Answer:

    bleu_score = nltk.translate.bleu_score.sentence_bleu([reference], candidate) -> Option B
  4. Quick Check:

    Use sentence_bleu([ref], cand) syntax [OK]
Hint: Use sentence_bleu with references as list of lists [OK]
Common Mistakes:
  • Passing candidate before reference
  • Not wrapping reference in a list
  • Using incorrect function names
3. Given the candidate sentence ["the", "cat", "is", "on", "the", "mat"] and reference sentence ["there", "is", "a", "cat", "on", "the", "mat"], what is the approximate BLEU score (unigram precision only)?
medium
A. 0.83
B. 0.50
C. 0.67
D. 0.33

Solution

  1. Step 1: Calculate unigram matches

    Candidate words: the, cat, is, on, the, mat
    Reference words: there, is, a, cat, on, the, mat
    Matching unigrams: the, cat, is, on, mat (count matches carefully)
  2. Step 2: Compute unigram precision

    Matches = 5 (the counted once), Candidate length = 6
    Precision = 5/6 ≈ 0.83 but 'the' appears twice in candidate but once in reference, so max count for 'the' is 1.
    Counting max matches: 'the' once, 'cat' once, 'is' once, 'on' once, 'mat' once = 5 matches
    Precision = 5/6 ≈ 0.83
  3. Step 3: Adjust for max counts

    Since 'the' appears twice in candidate but only once in reference, only one 'the' counts.
    So total matches = 5, candidate length = 6, precision = 5/6 ≈ 0.83
  4. Final Answer:

    0.83 -> Option A
  5. Quick Check:

    Unigram precision = 5/6 = 0.83 [OK]
Hint: Count max reference word matches for unigram precision [OK]
Common Mistakes:
  • Counting repeated words more than reference max
  • Confusing unigram with bigram precision
  • Ignoring max count clipping
4. Identify the error in this BLEU score calculation code snippet:
from nltk.translate.bleu_score import sentence_bleu
reference = ['the', 'cat', 'is', 'on', 'the', 'mat']
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate)
print(score)
medium
A. Candidate should be a string, not a list
B. Missing import for nltk
C. Reference should be a list of lists, not a single list
D. sentence_bleu requires lowercase strings only

Solution

  1. Step 1: Check sentence_bleu input format

    sentence_bleu expects references as a list of reference sentences (each a list of tokens), so reference must be wrapped in another list.
  2. Step 2: Identify the error in code

    Reference is given as a single list, not a list of lists, causing a type error or wrong calculation.
  3. Final Answer:

    Reference should be a list of lists, not a single list -> Option C
  4. Quick Check:

    References = list of lists [OK]
Hint: Wrap reference in a list for sentence_bleu [OK]
Common Mistakes:
  • Passing reference as a flat list
  • Passing candidate as string instead of list
  • Ignoring input format requirements
5. You have two reference translations:
ref1 = ['the', 'cat', 'is', 'on', 'the', 'mat']
ref2 = ['there', 'is', 'a', 'cat', 'on', 'the', 'mat']
And a candidate translation:
candidate = ['the', 'cat', 'sat', 'on', 'the', 'mat']
How should you prepare the references to correctly compute the BLEU score considering multiple references?
hard
A. Pass references as separate calls to sentence_bleu
B. Concatenate ref1 and ref2 into a single list and pass as one reference
C. Pass only the reference closest in length to candidate
D. Pass references as a list containing both ref1 and ref2 lists

Solution

  1. Step 1: Understand multiple references in BLEU

    BLEU supports multiple references by passing a list of reference sentences (each a list of tokens).
  2. Step 2: Prepare references correctly

    References should be passed as [ref1, ref2], a list containing both reference lists.
  3. Step 3: Avoid incorrect methods

    Concatenating references or passing separately will give wrong results.
  4. Final Answer:

    Pass references as a list containing both ref1 and ref2 lists -> Option D
  5. Quick Check:

    Multiple references = list of reference lists [OK]
Hint: Use list of reference lists for multiple references [OK]
Common Mistakes:
  • Concatenating references into one list
  • Passing references separately in multiple calls
  • Using only one reference when multiple exist