Beam search decoding helps find the best possible output sequence from a model by exploring multiple options at once, instead of just one. It balances between trying many possibilities and keeping the search manageable.
Beam search decoding in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
import math def beam_search_decoder(predictions, beam_width): sequences = [[[], 0.0]] for row in predictions: all_candidates = [] for seq, score in sequences: for j, prob in enumerate(row): if prob > 0: candidate = [seq + [j], score - math.log(prob)] all_candidates.append(candidate) ordered = sorted(all_candidates, key=lambda tup: tup[1]) sequences = ordered[:beam_width] return sequences
The predictions input is usually a list of probability distributions for each step in the sequence.
beam_width controls how many sequences to keep at each step; bigger means better results but slower.
beam_search_decoder([[0.1, 0.9], [0.8, 0.2]], beam_width=2)
beam_search_decoder([[0.3, 0.7, 0.0], [0.4, 0.4, 0.2]], beam_width=1)
This program runs beam search decoding on a small example with 3 steps and 3 tokens each. It prints the top 2 sequences found and their scores.
import math def beam_search_decoder(predictions, beam_width): sequences = [[[], 0.0]] # sequences and their scores (negative log probabilities) for row in predictions: all_candidates = [] for seq, score in sequences: for j, prob in enumerate(row): if prob > 0: candidate = [seq + [j], score - math.log(prob)] all_candidates.append(candidate) sequences = sorted(all_candidates, key=lambda tup: tup[1])[:beam_width] return sequences # Example predictions: 3 steps, 3 possible tokens each predictions = [ [0.1, 0.6, 0.3], [0.3, 0.5, 0.2], [0.4, 0.4, 0.2] ] beam_width = 2 results = beam_search_decoder(predictions, beam_width) print("Top sequences and their scores:") for seq, score in results: print(f"Sequence: {seq}, Score: {score:.4f}")
Beam search is not guaranteed to find the absolute best sequence but usually finds a good one faster than trying all possibilities.
Using log probabilities helps avoid very small numbers and makes multiplication into addition.
Choosing beam width is a trade-off: bigger means better results but slower computation.
Beam search keeps multiple best guesses at each step to find good output sequences.
It balances quality and speed by limiting how many sequences it tracks.
It is widely used in language tasks like translation and text generation.
Practice
Solution
Step 1: Understand beam search goal
Beam search keeps multiple candidate sequences to explore more options than greedy search.Step 2: Compare options
Only To keep track of multiple best candidate sequences during prediction describes keeping multiple best guesses; others describe random choice, single output, or unrelated speed-up.Final Answer:
To keep track of multiple best candidate sequences during prediction -> Option AQuick Check:
Beam search = multiple best sequences [OK]
- Confusing beam search with random sampling
- Thinking beam search outputs only one sequence
- Assuming beam search speeds up training
Solution
Step 1: Define beam width
Beam width is how many top sequences the algorithm keeps at each step to explore.Step 2: Eliminate incorrect options
Output length, vocabulary size, and network layers are unrelated to beam width.Final Answer:
The number of candidate sequences kept at each decoding step -> Option BQuick Check:
Beam width = candidate count per step [OK]
- Mixing beam width with output length
- Confusing beam width with vocabulary size
- Thinking beam width relates to model architecture
Solution
Step 1: Calculate scores for all expansions
Calculate combined scores: 0.6*0.5=0.3, 0.6*0.3=0.18, 0.4*0.7=0.28, 0.4*0.2=0.08.Step 2: Select top 2 sequences by score
Top two scores are 0.3 and 0.28, corresponding to first token + first expansion and second token + first expansion.Final Answer:
[First token + first expansion (0.6*0.5), Second token + first expansion (0.4*0.7)] -> Option DQuick Check:
Top scores = 0.3 and 0.28 [OK]
- Choosing expansions only from one token
- Not multiplying scores correctly
- Picking lower scoring sequences
Solution
Step 1: Analyze symptom of identical outputs
Always same output suggests no exploration of multiple sequences.Step 2: Identify beam width effect
If beam width = 1, beam search reduces to greedy search, always picking highest scoring token only.Final Answer:
Beam width is set to 1, making it greedy search -> Option CQuick Check:
Beam width 1 = greedy search [OK]
- Blaming vocabulary size for output sameness
- Ignoring beam width setting
- Assuming model training causes identical outputs
Solution
Step 1: Understand beam width effect on quality
Larger beam width explores more sequences, often improving output quality.Step 2: Understand beam width effect on speed
More sequences to track means more computation, slowing decoding speed.Final Answer:
Output quality may improve but decoding will be slower -> Option AQuick Check:
Higher beam width = better quality, slower speed [OK]
- Assuming bigger beam always speeds decoding
- Thinking quality decreases with bigger beam
- Believing beam width doesn't affect speed
