Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Transformer architecture overview in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Transformer Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main purpose of the self-attention mechanism in a Transformer?

The Transformer model uses a self-attention mechanism. What does this mechanism mainly do?

AIt helps the model focus on different parts of the input sequence to understand context.
BIt reduces the size of the input data by compressing it into a smaller vector.
CIt generates random noise to improve model robustness.
DIt sorts the input tokens in order of importance before processing.
Attempts:
2 left
💡 Hint

Think about how the model understands relationships between words in a sentence.

Predict Output
intermediate
2:00remaining
Output shape after multi-head attention layer

Given an input tensor of shape (batch_size=2, seq_len=5, embedding_dim=64) passed through a multi-head attention layer with 8 heads and output dimension 64, what is the shape of the output tensor?

Prompt Engineering / GenAI
import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embedding_dim = 64
num_heads = 8

x = torch.rand(batch_size, seq_len, embedding_dim)
mha = nn.MultiheadAttention(embed_dim=embedding_dim, num_heads=num_heads, batch_first=True)
output, _ = mha(x, x, x)
output.shape
A(2, 8, 8)
B(5, 2, 64)
C(2, 5, 512)
D(2, 5, 64)
Attempts:
2 left
💡 Hint

Remember the output shape matches the input sequence length and embedding dimension.

Hyperparameter
advanced
2:00remaining
Choosing the number of attention heads in a Transformer

Why might increasing the number of attention heads in a Transformer model improve performance?

ABecause more heads increase the embedding dimension automatically without extra computation.
BBecause more heads reduce the total number of parameters, making training faster.
CBecause more heads allow the model to attend to information from different representation subspaces at different positions.
DBecause more heads guarantee the model will not overfit on training data.
Attempts:
2 left
💡 Hint

Think about how multiple heads help the model see different aspects of the input.

Metrics
advanced
2:00remaining
Interpreting Transformer training loss curves

During training of a Transformer model, the training loss decreases steadily but the validation loss starts increasing after some epochs. What does this indicate?

AThe model is underfitting and needs more training epochs.
BThe model is overfitting the training data and not generalizing well to new data.
CThe learning rate is too low and should be increased.
DThe batch size is too large causing unstable training.
Attempts:
2 left
💡 Hint

Think about what it means when validation loss worsens but training loss improves.

🔧 Debug
expert
3:00remaining
Identifying cause of NaN values in Transformer training

While training a Transformer model, the loss suddenly becomes NaN after a few epochs. Which of the following is the most likely cause?

AThe learning rate is too high, causing unstable gradients and exploding values.
BThe batch size is too small, causing insufficient gradient updates.
CThe model has too few layers, limiting its capacity.
DThe input data is normalized, which causes NaN values.
Attempts:
2 left
💡 Hint

Consider what can cause gradients or loss to become infinite or undefined.

Practice

(1/5)
1. What is the main purpose of the attention mechanism in a Transformer model?
easy
A. To increase the size of the model
B. To focus on important parts of the input data
C. To reduce the number of layers
D. To store data permanently

Solution

  1. Step 1: Understand attention mechanism role

    The attention mechanism helps the model decide which parts of the input are important to focus on for better understanding.
  2. Step 2: Compare options with attention purpose

    Only To focus on important parts of the input data correctly describes this focus on important parts, while others describe unrelated functions.
  3. Final Answer:

    To focus on important parts of the input data -> Option B
  4. Quick Check:

    Attention = Focus on important parts [OK]
Hint: Attention means focusing on key input parts [OK]
Common Mistakes:
  • Thinking attention increases model size
  • Confusing attention with data storage
  • Assuming attention reduces layers
2. Which of the following is the correct order of components inside a Transformer encoder layer?
easy
A. Multi-head attention -> Feed-forward network -> Layer normalization
B. Feed-forward network -> Multi-head attention -> Layer normalization
C. Multi-head attention -> Layer normalization -> Feed-forward network
D. Layer normalization -> Multi-head attention -> Feed-forward network

Solution

  1. Step 1: Recall Transformer encoder layer structure

    The encoder layer first computes multi-head attention output, adds residual, applies layer normalization, then computes feed-forward network, followed by another residual and layer normalization.
  2. Step 2: Match the correct sequence

    The typical order is Multi-head attention -> Feed-forward network -> Layer normalization (with residual connections applied after each sub-layer).
  3. Final Answer:

    Multi-head attention -> Feed-forward network -> Layer normalization -> Option A
  4. Quick Check:

    Encoder order = Attn -> FFN -> Norm [OK]
Hint: Encoder: attn -> feed-forward -> norm [OK]
Common Mistakes:
  • Mixing up the order of feed-forward and attention
  • Placing layer normalization incorrectly
  • Assuming normalization comes first
3. Given a Transformer decoder layer with masked multi-head attention, what is the main reason for masking?
medium
A. To prevent the model from seeing future tokens during training
B. To speed up the training process
C. To increase the number of attention heads
D. To reduce the model size

Solution

  1. Step 1: Understand masking in decoder attention

    Masking hides future tokens so the model predicts the next word without cheating by looking ahead.
  2. Step 2: Evaluate options against masking purpose

    Only To prevent the model from seeing future tokens during training correctly explains masking's role to prevent future token visibility during training.
  3. Final Answer:

    To prevent the model from seeing future tokens during training -> Option A
  4. Quick Check:

    Masking = Hide future tokens [OK]
Hint: Masking hides future words in decoder [OK]
Common Mistakes:
  • Thinking masking speeds training
  • Confusing masking with model size reduction
  • Assuming masking adds attention heads
4. Consider this simplified Transformer encoder code snippet in Python:
import torch
import torch.nn as nn

class SimpleEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=2)
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        return attn_output

x = torch.rand(5, 3, 8)  # sequence length=5, batch=3, embed=8
model = SimpleEncoder()
output = model(x)
print(output.shape)
What is the error in this code?
medium
A. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (batch size, sequence length, embedding)
B. Input shape to MultiheadAttention should be (batch size, sequence length, embedding), but x is (5, 3, 8)
C. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (5, 3, 8)
D. Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct

Solution

  1. Step 1: Check expected input shape for nn.MultiheadAttention

    PyTorch's MultiheadAttention expects input shape (sequence length, batch size, embedding dimension).
  2. Step 2: Verify input tensor shape

    The tensor x has shape (5, 3, 8) which matches (sequence length=5, batch size=3, embedding=8), so it is correct.
  3. Final Answer:

    Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct -> Option D
  4. Quick Check:

    Input shape = (seq_len, batch, embed) [OK]
Hint: MultiheadAttention input shape is (seq_len, batch, embed) [OK]
Common Mistakes:
  • Confusing batch and sequence length order
  • Assuming batch size is first dimension
  • Mixing embedding dimension position
5. You want to build a Transformer model for translating short sentences. Which combination of components is essential in the Transformer architecture to handle this task effectively?
hard
A. Feed-forward networks only without attention
B. Only encoder layers with feed-forward networks
C. Encoder with self-attention, decoder with masked self-attention, and cross-attention layers
D. Decoder layers without attention mechanisms

Solution

  1. Step 1: Identify components needed for translation

    Translation requires understanding input (encoder) and generating output (decoder) with attention to both input and generated tokens.
  2. Step 2: Match components to translation needs

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers includes encoder self-attention, decoder masked self-attention to prevent future token peeking, and cross-attention to link encoder and decoder outputs, which is essential.
  3. Final Answer:

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers -> Option C
  4. Quick Check:

    Translation needs encoder, decoder, and cross-attention [OK]
Hint: Translation needs encoder, decoder, and cross-attention [OK]
Common Mistakes:
  • Ignoring decoder or cross-attention layers
  • Using only feed-forward networks
  • Skipping masking in decoder