The Transformer model uses a self-attention mechanism. What does this mechanism mainly do?
Think about how the model understands relationships between words in a sentence.
Self-attention allows the model to weigh the importance of each word relative to others, helping it understand context and meaning.
Given an input tensor of shape (batch_size=2, seq_len=5, embedding_dim=64) passed through a multi-head attention layer with 8 heads and output dimension 64, what is the shape of the output tensor?
import torch import torch.nn as nn batch_size = 2 seq_len = 5 embedding_dim = 64 num_heads = 8 x = torch.rand(batch_size, seq_len, embedding_dim) mha = nn.MultiheadAttention(embed_dim=embedding_dim, num_heads=num_heads, batch_first=True) output, _ = mha(x, x, x) output.shape
Remember the output shape matches the input sequence length and embedding dimension.
The multi-head attention layer outputs a tensor with the same batch size, sequence length, and embedding dimension as the input.
Why might increasing the number of attention heads in a Transformer model improve performance?
Think about how multiple heads help the model see different aspects of the input.
Multiple attention heads let the model focus on different parts or features of the input simultaneously, improving learning capacity.
During training of a Transformer model, the training loss decreases steadily but the validation loss starts increasing after some epochs. What does this indicate?
Think about what it means when validation loss worsens but training loss improves.
When validation loss increases while training loss decreases, the model fits training data too closely and fails to generalize, a sign of overfitting.
While training a Transformer model, the loss suddenly becomes NaN after a few epochs. Which of the following is the most likely cause?
Consider what can cause gradients or loss to become infinite or undefined.
A very high learning rate can cause gradients to explode, leading to NaN loss values during training.