0
0
PyTorchml~5 mins

Transformer decoder in PyTorch - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is the main role of the Transformer decoder in a sequence-to-sequence model?
The Transformer decoder generates the output sequence step-by-step by attending to the encoder's output and previously generated tokens, enabling tasks like translation or text generation.
Click to reveal answer
beginner
Name the three main sub-layers inside a Transformer decoder block.
1. Masked multi-head self-attention<br>2. Multi-head cross-attention (attends to encoder output)<br>3. Position-wise feed-forward network
Click to reveal answer
intermediate
Why does the Transformer decoder use masked self-attention?
Masked self-attention prevents the decoder from 'seeing' future tokens during training, ensuring predictions are made only from past and current tokens, preserving the autoregressive property.
Click to reveal answer
beginner
In PyTorch, which class is commonly used to implement a Transformer decoder layer?
torch.nn.TransformerDecoderLayer is used to build a single decoder layer, and torch.nn.TransformerDecoder stacks multiple such layers.
Click to reveal answer
intermediate
What is the purpose of the cross-attention sub-layer in the Transformer decoder?
Cross-attention allows the decoder to focus on relevant parts of the encoder's output, helping it generate context-aware outputs based on the input sequence.
Click to reveal answer
Which sub-layer in the Transformer decoder prevents the model from attending to future tokens?
AFeed-forward network
BCross-attention
CMasked multi-head self-attention
DLayer normalization
What does the cross-attention layer in the Transformer decoder attend to?
AEncoder outputs
BFuture decoder tokens
CPrevious decoder outputs only
DInput embeddings
In PyTorch, which class stacks multiple Transformer decoder layers?
Atorch.nn.TransformerEncoder
Btorch.nn.MultiheadAttention
Ctorch.nn.TransformerDecoderLayer
Dtorch.nn.TransformerDecoder
Why is masking important in the Transformer decoder's self-attention?
ATo prevent attending to future tokens
BTo speed up training
CTo reduce model size
DTo normalize inputs
Which of these is NOT a sub-layer in a Transformer decoder block?
AMasked multi-head self-attention
BConvolutional layer
CPosition-wise feed-forward network
DMulti-head cross-attention
Explain the architecture and function of a Transformer decoder layer.
Think about how the decoder processes previous outputs and encoder information step-by-step.
You got /5 concepts.
    Describe how masking works in the Transformer decoder and why it is necessary.
    Consider the order of token generation in language models.
    You got /4 concepts.