Bird
Raised Fist0
Prompt Engineering / GenAIml~10 mins

Transformer architecture overview in Prompt Engineering / GenAI - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to create the input embedding layer for a Transformer model.

Prompt Engineering / GenAI
embedding_layer = nn.Embedding(num_tokens, [1])
Drag options to blanks, or click blank then click option'
Aembedding_dim
Bnum_heads
Cnum_layers
Ddropout_rate
Attempts:
3 left
💡 Hint
Common Mistakes
Using number of heads instead of embedding dimension.
Confusing number of layers with embedding size.
2fill in blank
medium

Complete the code to apply multi-head attention in the Transformer encoder block.

Prompt Engineering / GenAI
attention_output, _ = multihead_attn(query, key, value, [1]=key_padding_mask)
Drag options to blanks, or click blank then click option'
Abias
Bdropout
Cattn_mask
Dkey_padding_mask
Attempts:
3 left
💡 Hint
Common Mistakes
Using attn_mask instead of key_padding_mask for padding.
Passing dropout parameter here instead of mask.
3fill in blank
hard

Fix the error in the Transformer feed-forward network layer by completing the missing activation function.

Prompt Engineering / GenAI
ffn_output = linear2([1](linear1(x)))
Drag options to blanks, or click blank then click option'
Asigmoid
Brelu
Csoftmax
Dtanh
Attempts:
3 left
💡 Hint
Common Mistakes
Using softmax which is for probabilities, not activations here.
Using sigmoid or tanh which are less common in Transformer FFN.
4fill in blank
hard

Fill both blanks to create a positional encoding function that adds position info to token embeddings.

Prompt Engineering / GenAI
positional_encoding = torch.zeros(seq_len, [1])
for pos in range(seq_len):
    for i in range(0, [2], 2):
        positional_encoding[pos, i] = math.sin(pos / (10000 ** (i / [2])))
Drag options to blanks, or click blank then click option'
Aembedding_dim
Bseq_len
Cnum_heads
Dbatch_size
Attempts:
3 left
💡 Hint
Common Mistakes
Using sequence length for the second blank which should be embedding dimension.
Confusing number of heads or batch size with embedding dimension.
5fill in blank
hard

Fill all three blanks to complete the Transformer encoder layer with normalization and residual connections.

Prompt Engineering / GenAI
x = x + [1](multihead_attn(x, x, x))
x = [2](x)
residual = x
x = x + [3](feed_forward(x))
Drag options to blanks, or click blank then click option'
Alayer_norm
Bdropout
Drelu
Attempts:
3 left
💡 Hint
Common Mistakes
Mixing up dropout and layer normalization order.
Using activation functions instead of dropout or normalization.

Practice

(1/5)
1. What is the main purpose of the attention mechanism in a Transformer model?
easy
A. To increase the size of the model
B. To focus on important parts of the input data
C. To reduce the number of layers
D. To store data permanently

Solution

  1. Step 1: Understand attention mechanism role

    The attention mechanism helps the model decide which parts of the input are important to focus on for better understanding.
  2. Step 2: Compare options with attention purpose

    Only To focus on important parts of the input data correctly describes this focus on important parts, while others describe unrelated functions.
  3. Final Answer:

    To focus on important parts of the input data -> Option B
  4. Quick Check:

    Attention = Focus on important parts [OK]
Hint: Attention means focusing on key input parts [OK]
Common Mistakes:
  • Thinking attention increases model size
  • Confusing attention with data storage
  • Assuming attention reduces layers
2. Which of the following is the correct order of components inside a Transformer encoder layer?
easy
A. Multi-head attention -> Feed-forward network -> Layer normalization
B. Feed-forward network -> Multi-head attention -> Layer normalization
C. Multi-head attention -> Layer normalization -> Feed-forward network
D. Layer normalization -> Multi-head attention -> Feed-forward network

Solution

  1. Step 1: Recall Transformer encoder layer structure

    The encoder layer first computes multi-head attention output, adds residual, applies layer normalization, then computes feed-forward network, followed by another residual and layer normalization.
  2. Step 2: Match the correct sequence

    The typical order is Multi-head attention -> Feed-forward network -> Layer normalization (with residual connections applied after each sub-layer).
  3. Final Answer:

    Multi-head attention -> Feed-forward network -> Layer normalization -> Option A
  4. Quick Check:

    Encoder order = Attn -> FFN -> Norm [OK]
Hint: Encoder: attn -> feed-forward -> norm [OK]
Common Mistakes:
  • Mixing up the order of feed-forward and attention
  • Placing layer normalization incorrectly
  • Assuming normalization comes first
3. Given a Transformer decoder layer with masked multi-head attention, what is the main reason for masking?
medium
A. To prevent the model from seeing future tokens during training
B. To speed up the training process
C. To increase the number of attention heads
D. To reduce the model size

Solution

  1. Step 1: Understand masking in decoder attention

    Masking hides future tokens so the model predicts the next word without cheating by looking ahead.
  2. Step 2: Evaluate options against masking purpose

    Only To prevent the model from seeing future tokens during training correctly explains masking's role to prevent future token visibility during training.
  3. Final Answer:

    To prevent the model from seeing future tokens during training -> Option A
  4. Quick Check:

    Masking = Hide future tokens [OK]
Hint: Masking hides future words in decoder [OK]
Common Mistakes:
  • Thinking masking speeds training
  • Confusing masking with model size reduction
  • Assuming masking adds attention heads
4. Consider this simplified Transformer encoder code snippet in Python:
import torch
import torch.nn as nn

class SimpleEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=2)
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        return attn_output

x = torch.rand(5, 3, 8)  # sequence length=5, batch=3, embed=8
model = SimpleEncoder()
output = model(x)
print(output.shape)
What is the error in this code?
medium
A. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (batch size, sequence length, embedding)
B. Input shape to MultiheadAttention should be (batch size, sequence length, embedding), but x is (5, 3, 8)
C. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (5, 3, 8)
D. Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct

Solution

  1. Step 1: Check expected input shape for nn.MultiheadAttention

    PyTorch's MultiheadAttention expects input shape (sequence length, batch size, embedding dimension).
  2. Step 2: Verify input tensor shape

    The tensor x has shape (5, 3, 8) which matches (sequence length=5, batch size=3, embedding=8), so it is correct.
  3. Final Answer:

    Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct -> Option D
  4. Quick Check:

    Input shape = (seq_len, batch, embed) [OK]
Hint: MultiheadAttention input shape is (seq_len, batch, embed) [OK]
Common Mistakes:
  • Confusing batch and sequence length order
  • Assuming batch size is first dimension
  • Mixing embedding dimension position
5. You want to build a Transformer model for translating short sentences. Which combination of components is essential in the Transformer architecture to handle this task effectively?
hard
A. Feed-forward networks only without attention
B. Only encoder layers with feed-forward networks
C. Encoder with self-attention, decoder with masked self-attention, and cross-attention layers
D. Decoder layers without attention mechanisms

Solution

  1. Step 1: Identify components needed for translation

    Translation requires understanding input (encoder) and generating output (decoder) with attention to both input and generated tokens.
  2. Step 2: Match components to translation needs

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers includes encoder self-attention, decoder masked self-attention to prevent future token peeking, and cross-attention to link encoder and decoder outputs, which is essential.
  3. Final Answer:

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers -> Option C
  4. Quick Check:

    Translation needs encoder, decoder, and cross-attention [OK]
Hint: Translation needs encoder, decoder, and cross-attention [OK]
Common Mistakes:
  • Ignoring decoder or cross-attention layers
  • Using only feed-forward networks
  • Skipping masking in decoder