Which statement best describes the role of the attention mechanism in a Transformer model?
Think about how the model decides which words to pay attention to when translating a sentence.
The attention mechanism helps the Transformer model weigh the importance of different input tokens dynamically for each output token, enabling better context understanding.
Given an input tensor of shape (batch_size=4, sequence_length=10, embedding_dim=64) passed through a Transformer encoder layer with the same embedding dimension, what will be the shape of the output tensor?
input_shape = (4, 10, 64) # Transformer encoder layer with embedding_dim=64 output_shape = (4, 10, 64)
The Transformer encoder preserves the sequence length and embedding dimension in its output.
The Transformer encoder layer outputs a tensor with the same batch size, sequence length, and embedding dimension as the input.
In a Transformer model, if the embedding dimension is 128, which choice of number of attention heads is valid and why?
Each attention head processes a slice of the embedding dimension equally.
The embedding dimension must be divisible by the number of heads so each head gets an equal-sized slice. 128 divided by 8 is 16, which is valid.
Which metric is most appropriate to evaluate a Transformer model trained for a multi-class text classification task?
Think about a task where the model picks one class label from many possible classes.
Accuracy is suitable for classification tasks as it shows how many predictions match the true labels.
A Transformer model training suddenly diverges with loss becoming NaN after a few epochs. Which of the following is the most likely cause?
Consider what causes gradients to become unstable during training.
A very high learning rate can cause gradients to explode, making loss values become NaN and training unstable.