In an encoder-decoder model for sequence-to-sequence tasks, what does the attention mechanism primarily help with?
Think about how the decoder decides which parts of the input to use when producing each output word.
The attention mechanism helps the decoder look at different parts of the input sequence at each step, instead of relying on a single fixed vector. This improves translation and other sequence tasks.
Given the following PyTorch code snippet for scaled dot-product attention weights calculation, what is the shape of attention_weights?
import torch batch_size = 2 seq_len_enc = 5 seq_len_dec = 3 hidden_dim = 4 encoder_outputs = torch.rand(batch_size, seq_len_enc, hidden_dim) decoder_hidden = torch.rand(batch_size, seq_len_dec, hidden_dim) # Compute attention scores scores = torch.bmm(decoder_hidden, encoder_outputs.transpose(1, 2)) / (hidden_dim ** 0.5) # Apply softmax to get attention weights attention_weights = torch.softmax(scores, dim=2) print(attention_weights.shape)
Recall that torch.bmm batch-multiplies matrices of shape (batch, n, m) and (batch, m, p) resulting in (batch, n, p).
The scores tensor has shape (batch_size, seq_len_dec, seq_len_enc). After softmax over encoder sequence length, attention_weights shape remains (2, 3, 5).
You want to build an encoder-decoder model for translating very long sentences. Which attention mechanism is best to handle long input sequences efficiently?
Consider the computational cost and relevance of distant input tokens for very long sequences.
Local attention reduces computation by focusing on a small relevant window, making it more efficient for long inputs while still capturing important context.
In a transformer encoder-decoder model, what is the effect of increasing the number of attention heads in multi-head attention?
Think about why multiple attention heads might help the model understand different aspects of the input.
Multiple heads let the model focus on different parts or features of the input simultaneously, improving learning and representation.
During training of an encoder-decoder model with attention, the loss suddenly becomes NaN after a few epochs. Which of the following is the most likely cause?
Consider what can cause softmax to produce invalid values and how attention scores are computed.
If attention scores are very large or not scaled, softmax can produce NaNs due to overflow. Proper scaling (e.g., dividing by sqrt of hidden size) prevents this.