Recall & Review
beginner
What is the main role of the Transformer decoder in a sequence-to-sequence model?
The Transformer decoder generates the output sequence step-by-step by attending to the encoder's output and previously generated tokens, enabling tasks like translation or text generation.
Click to reveal answer
beginner
Name the three main sub-layers inside a Transformer decoder block.
1. Masked multi-head self-attention<br>2. Multi-head cross-attention (attends to encoder output)<br>3. Position-wise feed-forward network
Click to reveal answer
intermediate
Why does the Transformer decoder use masked self-attention?
Masked self-attention prevents the decoder from 'seeing' future tokens during training, ensuring predictions are made only from past and current tokens, preserving the autoregressive property.
Click to reveal answer
beginner
In PyTorch, which class is commonly used to implement a Transformer decoder layer?torch.nn.TransformerDecoderLayer is used to build a single decoder layer, and torch.nn.TransformerDecoder stacks multiple such layers.
Click to reveal answer
intermediate
What is the purpose of the cross-attention sub-layer in the Transformer decoder?
Cross-attention allows the decoder to focus on relevant parts of the encoder's output, helping it generate context-aware outputs based on the input sequence.
Click to reveal answer
Which sub-layer in the Transformer decoder prevents the model from attending to future tokens?
✗ Incorrect
Masked multi-head self-attention blocks future tokens to maintain autoregressive generation.
What does the cross-attention layer in the Transformer decoder attend to?
✗ Incorrect
Cross-attention attends to encoder outputs to incorporate input context.
In PyTorch, which class stacks multiple Transformer decoder layers?
✗ Incorrect
torch.nn.TransformerDecoder stacks multiple TransformerDecoderLayer instances.
Why is masking important in the Transformer decoder's self-attention?
✗ Incorrect
Masking ensures the decoder only uses past and current tokens for prediction.
Which of these is NOT a sub-layer in a Transformer decoder block?
✗ Incorrect
Transformer decoders do not use convolutional layers in their standard architecture.
Explain the architecture and function of a Transformer decoder layer.
Think about how the decoder processes previous outputs and encoder information step-by-step.
You got /5 concepts.
Describe how masking works in the Transformer decoder and why it is necessary.
Consider the order of token generation in language models.
You got /4 concepts.