Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the main purpose of the encoder in an encoder-decoder model?
The encoder reads the input data and converts it into a fixed-size representation (a context vector) that summarizes the input information for the decoder to use.
Click to reveal answer
beginner
Why do we use attention in encoder-decoder models?
Attention helps the decoder focus on different parts of the input sequence at each step, instead of relying on a single fixed context vector. This improves performance, especially for long sequences.
Click to reveal answer
intermediate
Describe how the attention mechanism works in simple terms.
At each step, the decoder looks at all encoder outputs and assigns weights (attention scores) to them. These weights show how important each input part is for generating the current output word.
Click to reveal answer
intermediate
What is the difference between the context vector in a basic encoder-decoder and one with attention?
In a basic model, the context vector is fixed and the same for all output steps. With attention, the context vector changes at each step, computed as a weighted sum of encoder outputs based on attention scores.
Click to reveal answer
intermediate
How does attention improve translation quality in machine translation tasks?
Attention allows the model to align output words with relevant input words dynamically, helping it handle long sentences and complex structures better than fixed context models.
Click to reveal answer
What does the encoder output in an encoder-decoder model with attention?
AThe final output sentence
BA single word prediction
CA sequence of hidden states representing the input
DThe loss value
✗ Incorrect
The encoder outputs a sequence of hidden states that represent the input sequence, which the attention mechanism uses.
In attention, what do the attention weights represent?
AThe length of the input sequence
BThe importance of each input token for the current output token
CThe number of layers in the model
DThe learning rate
✗ Incorrect
Attention weights show how much each input token should influence the current output token.
Why is a fixed context vector limiting in basic encoder-decoder models?
AIt cannot capture all input details for long sequences
BIt increases training speed
CIt reduces model size
DIt improves output diversity
✗ Incorrect
A fixed context vector compresses all input information into one vector, which can lose details especially for long inputs.
Which part of the model uses attention scores to generate output?
ADecoder
BEncoder
CInput layer
DLoss function
✗ Incorrect
The decoder uses attention scores to focus on relevant encoder outputs when producing each output token.
What is a common benefit of adding attention to encoder-decoder models?
ALess training data needed
BFaster inference without accuracy change
CSimpler model architecture
DBetter handling of long input sequences
✗ Incorrect
Attention helps models better process long sequences by focusing on important parts dynamically.
Explain how the attention mechanism changes the way the decoder generates each output token compared to a basic encoder-decoder model.
Think about how the decoder decides what input information to use at each step.
You got /4 concepts.
Describe the roles of the encoder, decoder, and attention mechanism in an encoder-decoder model with attention.
Consider how these components work together to produce output.
You got /3 concepts.
Practice
(1/5)
1. What is the main purpose of the attention mechanism in an encoder-decoder model?
easy
A. To randomly select input tokens for the decoder
B. To help the model focus on relevant parts of the input sequence when generating each output token
C. To speed up the training by skipping some input tokens
D. To reduce the size of the input data before encoding
Solution
Step 1: Understand the role of attention in sequence models
Attention helps the decoder look at specific parts of the input sequence instead of the whole input equally.
Step 2: Identify the correct purpose
The attention mechanism focuses on relevant input parts to improve output quality.
Final Answer:
To help the model focus on relevant parts of the input sequence when generating each output token -> Option B
Quick Check:
Attention = Focus on input parts [OK]
Hint: Attention means focusing on important input parts [OK]
Common Mistakes:
Thinking attention reduces input size
Believing attention speeds training by skipping tokens
Assuming attention randomly selects tokens
2. Which of the following is the correct way to compute the attention weights in an encoder-decoder model?
easy
A. Apply softmax to the dot product of decoder hidden state and encoder outputs
B. Add encoder outputs and decoder outputs directly without normalization
C. Multiply decoder output by a random matrix
D. Use the maximum value of encoder outputs as attention weight
Solution
Step 1: Recall attention weight calculation
Attention weights are usually computed by taking the dot product between the decoder's current hidden state and each encoder output, then applying softmax to get probabilities.
Step 2: Match the correct formula
Apply softmax to the dot product of decoder hidden state and encoder outputs correctly describes this process with softmax on dot product.
Final Answer:
Apply softmax to the dot product of decoder hidden state and encoder outputs -> Option A
Quick Check:
Attention weights = softmax(dot product) [OK]
Hint: Attention weights come from softmax of dot products [OK]
Common Mistakes:
Skipping softmax normalization
Adding outputs without weighting
Using random matrices instead of encoder states
3. Given the following simplified code snippet for attention weights calculation, what will be the output shape of attention_weights?
4. You implemented an encoder-decoder with attention model but notice the attention weights are always uniform (equal values). What is the most likely cause?
medium
A. The batch size is too small
B. The encoder outputs have different dimensions than decoder hidden states
C. The model uses too many layers in the encoder
D. The softmax function is missing after computing attention scores
Solution
Step 1: Understand uniform attention weights meaning
If attention weights are uniform, the model treats all input tokens equally without focusing on any part.
Step 2: Identify missing softmax effect
Without softmax, raw scores are not normalized into probabilities, causing uniform or incorrect weights.
Final Answer:
The softmax function is missing after computing attention scores -> Option D
Quick Check:
Missing softmax = uniform attention weights [OK]
Hint: Always apply softmax to attention scores [OK]
Common Mistakes:
Ignoring normalization step
Blaming encoder size or batch size
Assuming model depth causes uniform weights
5. In a machine translation task using an encoder-decoder with attention, the model struggles to translate long sentences accurately. Which modification can best help improve performance?
hard
A. Remove the attention mechanism to simplify the model
B. Reduce the encoder hidden size to speed up training
C. Use multi-head attention to capture different aspects of the input simultaneously
D. Increase the batch size without changing the model
Solution
Step 1: Identify challenges with long sentences
Long sentences require the model to focus on multiple relevant parts; single attention may miss some details.
Step 2: Understand multi-head attention benefits
Multi-head attention allows the model to attend to different parts of the input in parallel, improving context understanding.
Final Answer:
Use multi-head attention to capture different aspects of the input simultaneously -> Option C
Quick Check:
Multi-head attention = better long sentence handling [OK]
Hint: Multi-head attention improves focus on complex inputs [OK]
Common Mistakes:
Thinking smaller hidden size helps accuracy
Removing attention reduces model power
Assuming batch size alone fixes long sentence issues