Beam search is used in tasks like language translation or text generation. The main goal is to find the best sequence of words. So, metrics that measure how good the generated sequences are matter most. These include BLEU score (how close the output is to a reference), perplexity (how well the model predicts the next word), and sequence accuracy (exact match rate). These metrics tell us if beam search helps find better sentences than simpler methods.
Beam search decoding in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Beam search decoding does not produce a simple confusion matrix like classification. Instead, we can visualize the beam paths and their scores. For example, if beam width is 3, at each step the top 3 partial sequences are kept. The scores show how likely each sequence is. A simple ASCII example for one step:
Step 1: ["I" (score=0.9), "It" (0.7), "In" (0.5)]
Step 2:
"I am" (0.8), "I is" (0.6), "I was" (0.5)
"It is" (0.7), "It was" (0.4), "It can" (0.3)
"In the" (0.5), "In a" (0.4), "In my" (0.3)
Top 3 sequences kept based on scores.
This shows how beam search narrows down the best sequences step by step.
Beam search balances between exploring many possible sequences (high recall of options) and focusing on the best ones (high precision). A small beam width means fewer sequences kept, so it is fast but might miss the best sentence (low recall). A large beam width keeps many sequences, increasing chance to find the best output (high recall), but is slower and may include many poor sequences (lower precision).
Example: In machine translation, a beam width of 1 (greedy search) might miss a better translation. A beam width of 10 finds better sentences but takes more time. Choosing beam width is a tradeoff between quality and speed.
Good: High BLEU score (close to human translation), low perplexity, and high sequence accuracy. This means beam search finds sequences that match references well and the model predicts words confidently.
Bad: Low BLEU score, high perplexity, or low sequence accuracy. This means beam search is not helping or is stuck in poor sequences. For example, if increasing beam width does not improve BLEU, the search might be ineffective.
- Ignoring diversity: Beam search can produce very similar sequences, so metrics might look good but outputs lack variety.
- Overfitting to references: BLEU score depends on reference sentences; good BLEU does not always mean better meaning.
- Beam width too small or too large: Too small misses good sequences; too large wastes time and may pick worse sequences due to score biases.
- Not considering length bias: Beam search can favor shorter or longer sequences unfairly, affecting metric scores.
Your beam search model has a BLEU score of 25 but increasing beam width from 5 to 20 does not improve BLEU. Is your beam search working well? Why or why not?
Answer: The beam search might not be effective beyond width 5. It could be stuck in local optima or the model scores do not help find better sequences. Increasing beam width wastes time without quality gain. You may need to check model scoring or try other decoding methods.
Practice
Solution
Step 1: Understand beam search goal
Beam search keeps multiple candidate sequences to explore more options than greedy search.Step 2: Compare options
Only To keep track of multiple best candidate sequences during prediction describes keeping multiple best guesses; others describe random choice, single output, or unrelated speed-up.Final Answer:
To keep track of multiple best candidate sequences during prediction -> Option AQuick Check:
Beam search = multiple best sequences [OK]
- Confusing beam search with random sampling
- Thinking beam search outputs only one sequence
- Assuming beam search speeds up training
Solution
Step 1: Define beam width
Beam width is how many top sequences the algorithm keeps at each step to explore.Step 2: Eliminate incorrect options
Output length, vocabulary size, and network layers are unrelated to beam width.Final Answer:
The number of candidate sequences kept at each decoding step -> Option BQuick Check:
Beam width = candidate count per step [OK]
- Mixing beam width with output length
- Confusing beam width with vocabulary size
- Thinking beam width relates to model architecture
Solution
Step 1: Calculate scores for all expansions
Calculate combined scores: 0.6*0.5=0.3, 0.6*0.3=0.18, 0.4*0.7=0.28, 0.4*0.2=0.08.Step 2: Select top 2 sequences by score
Top two scores are 0.3 and 0.28, corresponding to first token + first expansion and second token + first expansion.Final Answer:
[First token + first expansion (0.6*0.5), Second token + first expansion (0.4*0.7)] -> Option DQuick Check:
Top scores = 0.3 and 0.28 [OK]
- Choosing expansions only from one token
- Not multiplying scores correctly
- Picking lower scoring sequences
Solution
Step 1: Analyze symptom of identical outputs
Always same output suggests no exploration of multiple sequences.Step 2: Identify beam width effect
If beam width = 1, beam search reduces to greedy search, always picking highest scoring token only.Final Answer:
Beam width is set to 1, making it greedy search -> Option CQuick Check:
Beam width 1 = greedy search [OK]
- Blaming vocabulary size for output sameness
- Ignoring beam width setting
- Assuming model training causes identical outputs
Solution
Step 1: Understand beam width effect on quality
Larger beam width explores more sequences, often improving output quality.Step 2: Understand beam width effect on speed
More sequences to track means more computation, slowing decoding speed.Final Answer:
Output quality may improve but decoding will be slower -> Option AQuick Check:
Higher beam width = better quality, slower speed [OK]
- Assuming bigger beam always speeds decoding
- Thinking quality decreases with bigger beam
- Believing beam width doesn't affect speed
