Bird
Raised Fist0
NLPml~8 mins

Beam search decoding in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Beam search decoding
Which metric matters for Beam Search Decoding and WHY

Beam search is used in tasks like language translation or text generation. The main goal is to find the best sequence of words. So, metrics that measure how good the generated sequences are matter most. These include BLEU score (how close the output is to a reference), perplexity (how well the model predicts the next word), and sequence accuracy (exact match rate). These metrics tell us if beam search helps find better sentences than simpler methods.

Confusion Matrix or Equivalent Visualization

Beam search decoding does not produce a simple confusion matrix like classification. Instead, we can visualize the beam paths and their scores. For example, if beam width is 3, at each step the top 3 partial sequences are kept. The scores show how likely each sequence is. A simple ASCII example for one step:

Step 1: ["I" (score=0.9), "It" (0.7), "In" (0.5)]
Step 2: 
  "I am" (0.8), "I is" (0.6), "I was" (0.5)
  "It is" (0.7), "It was" (0.4), "It can" (0.3)
  "In the" (0.5), "In a" (0.4), "In my" (0.3)
Top 3 sequences kept based on scores.
    

This shows how beam search narrows down the best sequences step by step.

Precision vs Recall Tradeoff (or Equivalent) with Concrete Examples

Beam search balances between exploring many possible sequences (high recall of options) and focusing on the best ones (high precision). A small beam width means fewer sequences kept, so it is fast but might miss the best sentence (low recall). A large beam width keeps many sequences, increasing chance to find the best output (high recall), but is slower and may include many poor sequences (lower precision).

Example: In machine translation, a beam width of 1 (greedy search) might miss a better translation. A beam width of 10 finds better sentences but takes more time. Choosing beam width is a tradeoff between quality and speed.

What "Good" vs "Bad" Metric Values Look Like for Beam Search Decoding

Good: High BLEU score (close to human translation), low perplexity, and high sequence accuracy. This means beam search finds sequences that match references well and the model predicts words confidently.

Bad: Low BLEU score, high perplexity, or low sequence accuracy. This means beam search is not helping or is stuck in poor sequences. For example, if increasing beam width does not improve BLEU, the search might be ineffective.

Common Metrics Pitfalls
  • Ignoring diversity: Beam search can produce very similar sequences, so metrics might look good but outputs lack variety.
  • Overfitting to references: BLEU score depends on reference sentences; good BLEU does not always mean better meaning.
  • Beam width too small or too large: Too small misses good sequences; too large wastes time and may pick worse sequences due to score biases.
  • Not considering length bias: Beam search can favor shorter or longer sequences unfairly, affecting metric scores.
Self Check

Your beam search model has a BLEU score of 25 but increasing beam width from 5 to 20 does not improve BLEU. Is your beam search working well? Why or why not?

Answer: The beam search might not be effective beyond width 5. It could be stuck in local optima or the model scores do not help find better sequences. Increasing beam width wastes time without quality gain. You may need to check model scoring or try other decoding methods.

Key Result
Beam search decoding quality is best measured by sequence-level metrics like BLEU and perplexity, balancing beam width for quality and speed.

Practice

(1/5)
1. What is the main purpose of beam search decoding in natural language processing?
easy
A. To keep track of multiple best candidate sequences during prediction
B. To randomly select words for output generation
C. To generate only one possible output sequence
D. To speed up training by skipping steps

Solution

  1. Step 1: Understand beam search goal

    Beam search keeps multiple candidate sequences to explore more options than greedy search.
  2. Step 2: Compare options

    Only To keep track of multiple best candidate sequences during prediction describes keeping multiple best guesses; others describe random choice, single output, or unrelated speed-up.
  3. Final Answer:

    To keep track of multiple best candidate sequences during prediction -> Option A
  4. Quick Check:

    Beam search = multiple best sequences [OK]
Hint: Beam search tracks several top guesses, not just one [OK]
Common Mistakes:
  • Confusing beam search with random sampling
  • Thinking beam search outputs only one sequence
  • Assuming beam search speeds up training
2. Which of the following is the correct way to describe the beam width in beam search decoding?
easy
A. The size of the vocabulary used for prediction
B. The number of candidate sequences kept at each decoding step
C. The length of the output sequence generated
D. The number of layers in the neural network

Solution

  1. Step 1: Define beam width

    Beam width is how many top sequences the algorithm keeps at each step to explore.
  2. Step 2: Eliminate incorrect options

    Output length, vocabulary size, and network layers are unrelated to beam width.
  3. Final Answer:

    The number of candidate sequences kept at each decoding step -> Option B
  4. Quick Check:

    Beam width = candidate count per step [OK]
Hint: Beam width = how many sequences you keep each step [OK]
Common Mistakes:
  • Mixing beam width with output length
  • Confusing beam width with vocabulary size
  • Thinking beam width relates to model architecture
3. Consider a beam search with beam width 2 decoding a sequence. At step 1, the top 2 tokens have scores [0.6, 0.4]. At step 2, each token expands to two tokens with scores: from first token [0.5, 0.3], from second token [0.7, 0.2]. Which two sequences will beam search keep after step 2?
medium
A. [First token + second expansion (0.6*0.3), Second token + second expansion (0.4*0.2)]
B. [First token + first expansion (0.6*0.5), First token + second expansion (0.6*0.3)]
C. [Second token + first expansion (0.4*0.7), Second token + second expansion (0.4*0.2)]
D. [First token + first expansion (0.6*0.5), Second token + first expansion (0.4*0.7)]

Solution

  1. Step 1: Calculate scores for all expansions

    Calculate combined scores: 0.6*0.5=0.3, 0.6*0.3=0.18, 0.4*0.7=0.28, 0.4*0.2=0.08.
  2. Step 2: Select top 2 sequences by score

    Top two scores are 0.3 and 0.28, corresponding to first token + first expansion and second token + first expansion.
  3. Final Answer:

    [First token + first expansion (0.6*0.5), Second token + first expansion (0.4*0.7)] -> Option D
  4. Quick Check:

    Top scores = 0.3 and 0.28 [OK]
Hint: Multiply scores, pick top beam width sequences [OK]
Common Mistakes:
  • Choosing expansions only from one token
  • Not multiplying scores correctly
  • Picking lower scoring sequences
4. You implemented beam search decoding but notice it always returns the same output sequence regardless of input. What is the most likely bug?
medium
A. The vocabulary size is too large
B. The model is not trained
C. Beam width is set to 1, making it greedy search
D. The beam search is not normalizing scores

Solution

  1. Step 1: Analyze symptom of identical outputs

    Always same output suggests no exploration of multiple sequences.
  2. Step 2: Identify beam width effect

    If beam width = 1, beam search reduces to greedy search, always picking highest scoring token only.
  3. Final Answer:

    Beam width is set to 1, making it greedy search -> Option C
  4. Quick Check:

    Beam width 1 = greedy search [OK]
Hint: Check beam width; 1 means no beam search [OK]
Common Mistakes:
  • Blaming vocabulary size for output sameness
  • Ignoring beam width setting
  • Assuming model training causes identical outputs
5. In a machine translation task, you want to balance output quality and decoding speed. You have a beam search decoder with beam width 5. What happens if you increase the beam width to 20?
hard
A. Output quality may improve but decoding will be slower
B. Output quality will decrease and decoding will be faster
C. Output quality and decoding speed remain the same
D. Decoding speed improves but output quality is unpredictable

Solution

  1. Step 1: Understand beam width effect on quality

    Larger beam width explores more sequences, often improving output quality.
  2. Step 2: Understand beam width effect on speed

    More sequences to track means more computation, slowing decoding speed.
  3. Final Answer:

    Output quality may improve but decoding will be slower -> Option A
  4. Quick Check:

    Higher beam width = better quality, slower speed [OK]
Hint: Bigger beam = better results but slower decoding [OK]
Common Mistakes:
  • Assuming bigger beam always speeds decoding
  • Thinking quality decreases with bigger beam
  • Believing beam width doesn't affect speed