beginner

What is the main role of the Transformer decoder in a sequence-to-sequence model?

The Transformer decoder generates the output sequence step-by-step by attending to the encoder's output and previously generated tokens, enabling tasks like translation or text generation.

Click to reveal answer

beginner

Name the three main sub-layers inside a Transformer decoder block.

1. Masked multi-head self-attention<br>2. Multi-head cross-attention (attends to encoder output)<br>3. Position-wise feed-forward network

Click to reveal answer

intermediate

Why does the Transformer decoder use masked self-attention?

Masked self-attention prevents the decoder from 'seeing' future tokens during training, ensuring predictions are made only from past and current tokens, preserving the autoregressive property.

Click to reveal answer

beginner

In PyTorch, which class is commonly used to implement a Transformer decoder layer?

torch.nn.TransformerDecoderLayer is used to build a single decoder layer, and torch.nn.TransformerDecoder stacks multiple such layers.

Click to reveal answer

intermediate

What is the purpose of the cross-attention sub-layer in the Transformer decoder?

Cross-attention allows the decoder to focus on relevant parts of the encoder's output, helping it generate context-aware outputs based on the input sequence.

Click to reveal answer

Which sub-layer in the Transformer decoder prevents the model from attending to future tokens?

AFeed-forward network

BCross-attention

CMasked multi-head self-attention

DLayer normalization

What does the cross-attention layer in the Transformer decoder attend to?

AEncoder outputs

BFuture decoder tokens

CPrevious decoder outputs only

DInput embeddings

In PyTorch, which class stacks multiple Transformer decoder layers?

Atorch.nn.TransformerEncoder

Btorch.nn.MultiheadAttention

Ctorch.nn.TransformerDecoderLayer

Dtorch.nn.TransformerDecoder

Why is masking important in the Transformer decoder's self-attention?

ATo prevent attending to future tokens

BTo speed up training

CTo reduce model size

DTo normalize inputs

Which of these is NOT a sub-layer in a Transformer decoder block?

AMasked multi-head self-attention

BConvolutional layer

CPosition-wise feed-forward network

DMulti-head cross-attention

Explain the architecture and function of a Transformer decoder layer.

Describe how masking works in the Transformer decoder and why it is necessary.