Practice

(1/5)

1. What is the main purpose of the attention mechanism in a Transformer model?

easy

A. To increase the size of the model

B. To focus on important parts of the input data

C. To reduce the number of layers

D. To store data permanently

Solution

Step 1: Understand attention mechanism role
The attention mechanism helps the model decide which parts of the input are important to focus on for better understanding.
Step 2: Compare options with attention purpose
Only To focus on important parts of the input data correctly describes this focus on important parts, while others describe unrelated functions.
Final Answer:
To focus on important parts of the input data -> Option B
Quick Check:
Attention = Focus on important parts [OK]

Hint: Attention means focusing on key input parts [OK]

Common Mistakes:

Thinking attention increases model size
Confusing attention with data storage
Assuming attention reduces layers

2. Which of the following is the correct order of components inside a Transformer encoder layer?

easy

A. Multi-head attention -> Feed-forward network -> Layer normalization

B. Feed-forward network -> Multi-head attention -> Layer normalization

C. Multi-head attention -> Layer normalization -> Feed-forward network

D. Layer normalization -> Multi-head attention -> Feed-forward network

Solution

Step 1: Recall Transformer encoder layer structure
The encoder layer first computes multi-head attention output, adds residual, applies layer normalization, then computes feed-forward network, followed by another residual and layer normalization.
Step 2: Match the correct sequence
The typical order is Multi-head attention -> Feed-forward network -> Layer normalization (with residual connections applied after each sub-layer).
Final Answer:
Multi-head attention -> Feed-forward network -> Layer normalization -> Option A
Quick Check:
Encoder order = Attn -> FFN -> Norm [OK]

Hint: Encoder: attn -> feed-forward -> norm [OK]

Common Mistakes:

Mixing up the order of feed-forward and attention
Placing layer normalization incorrectly
Assuming normalization comes first

3. Given a Transformer decoder layer with masked multi-head attention, what is the main reason for masking?

medium

A. To prevent the model from seeing future tokens during training

B. To speed up the training process

C. To increase the number of attention heads

D. To reduce the model size

Solution

Step 1: Understand masking in decoder attention
Masking hides future tokens so the model predicts the next word without cheating by looking ahead.
Step 2: Evaluate options against masking purpose
Only To prevent the model from seeing future tokens during training correctly explains masking's role to prevent future token visibility during training.
Final Answer:
To prevent the model from seeing future tokens during training -> Option A
Quick Check:
Masking = Hide future tokens [OK]

Hint: Masking hides future words in decoder [OK]

Common Mistakes:

Thinking masking speeds training
Confusing masking with model size reduction
Assuming masking adds attention heads

4. Consider this simplified Transformer encoder code snippet in Python:

import torch
import torch.nn as nn

class SimpleEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=2)
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        return attn_output

x = torch.rand(5, 3, 8)  # sequence length=5, batch=3, embed=8
model = SimpleEncoder()
output = model(x)
print(output.shape)

What is the error in this code?

medium

A. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (batch size, sequence length, embedding)

B. Input shape to MultiheadAttention should be (batch size, sequence length, embedding), but x is (5, 3, 8)

C. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (5, 3, 8)

D. Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct

Solution

Step 1: Check expected input shape for nn.MultiheadAttention
PyTorch's MultiheadAttention expects input shape (sequence length, batch size, embedding dimension).
Step 2: Verify input tensor shape
The tensor x has shape (5, 3, 8) which matches (sequence length=5, batch size=3, embedding=8), so it is correct.
Final Answer:
Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct -> Option D
Quick Check:
Input shape = (seq_len, batch, embed) [OK]

Hint: MultiheadAttention input shape is (seq_len, batch, embed) [OK]

Common Mistakes:

Confusing batch and sequence length order
Assuming batch size is first dimension
Mixing embedding dimension position

5. You want to build a Transformer model for translating short sentences. Which combination of components is essential in the Transformer architecture to handle this task effectively?

hard

A. Feed-forward networks only without attention

B. Only encoder layers with feed-forward networks

C. Encoder with self-attention, decoder with masked self-attention, and cross-attention layers

D. Decoder layers without attention mechanisms

Solution

Step 1: Identify components needed for translation
Translation requires understanding input (encoder) and generating output (decoder) with attention to both input and generated tokens.
Step 2: Match components to translation needs
Encoder with self-attention, decoder with masked self-attention, and cross-attention layers includes encoder self-attention, decoder masked self-attention to prevent future token peeking, and cross-attention to link encoder and decoder outputs, which is essential.
Final Answer:
Encoder with self-attention, decoder with masked self-attention, and cross-attention layers -> Option C
Quick Check:
Translation needs encoder, decoder, and cross-attention [OK]

Hint: Translation needs encoder, decoder, and cross-attention [OK]

Common Mistakes:

Ignoring decoder or cross-attention layers
Using only feed-forward networks
Skipping masking in decoder

Epoch	Loss ↓	Accuracy ↑	Observation
1	5.2	0.12	High loss and low accuracy at start
2	3.8	0.35	Loss decreased, accuracy improved
3	2.7	0.52	Model learning meaningful patterns
4	1.9	0.68	Good progress, loss dropping steadily
5	1.3	0.78	Model converging with better accuracy
6	0.9	0.85	Loss low, accuracy high, training stable

Transformer architecture overview in Prompt Engineering / GenAI - Model Pipeline Trace

Start learning this pattern below

Practice

Solution

Step 1: Understand attention mechanism role

Step 2: Compare options with attention purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall Transformer encoder layer structure

Step 2: Match the correct sequence

Final Answer:

Quick Check:

Solution

Step 1: Understand masking in decoder attention

Step 2: Evaluate options against masking purpose

Final Answer:

Quick Check:

Solution

Step 1: Check expected input shape for nn.MultiheadAttention

Step 2: Verify input tensor shape

Final Answer:

Quick Check:

Solution

Step 1: Identify components needed for translation

Step 2: Match components to translation needs

Final Answer:

Quick Check: