In the Transformer model, the self-attention mechanism helps the model to:
Think about how the model learns connections between words regardless of their position.
Self-attention allows the Transformer to weigh the importance of each word in the input sequence relative to others, capturing context effectively.
Given the following code snippet using PyTorch, what is the shape of the output tensor?
import torch import torch.nn as nn batch_size = 2 seq_len = 5 embed_dim = 16 num_heads = 4 x = torch.rand(batch_size, seq_len, embed_dim) mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads) # PyTorch MultiheadAttention expects input shape (seq_len, batch_size, embed_dim) x_t = x.transpose(0, 1) out, _ = mha(x_t, x_t, x_t) output_shape = out.shape
Check the input and output shapes expected by PyTorch's MultiheadAttention.
PyTorch's MultiheadAttention expects input shape (sequence length, batch size, embedding dimension) and returns output of the same shape.
Which of the following is a valid reason to increase the number of attention heads in a Transformer model?
Think about how multiple heads help the model understand different aspects of the input.
Multiple attention heads allow the model to jointly attend to information from different representation subspaces, improving learning capacity.
During training a Transformer for language modeling, the loss decreases steadily but the validation loss starts increasing after some epochs. What does this indicate?
Consider what it means when training loss improves but validation loss worsens.
When validation loss increases while training loss decreases, the model is memorizing training data but failing to generalize, a sign of overfitting.
Consider this simplified code snippet for positional encoding in a Transformer. What error will this code raise when run?
import torch import math def positional_encoding(seq_len, d_model): pe = torch.zeros(seq_len, d_model) position = torch.arange(0, seq_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe pos_enc = positional_encoding(10, 7)
Check the shapes of slices pe[:, 0::2] and pe[:, 1::2] when d_model is odd.
When d_model is odd (7), pe[:, 1::2] has one more column than div_term, causing a shape mismatch error during assignment.