Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the main purpose of the Transformer architecture in AI?
The Transformer architecture is designed to process sequences of data, like sentences, by focusing on relationships between all parts of the sequence at once, enabling better understanding and generation of language.
Click to reveal answer
beginner
What does 'self-attention' mean in the Transformer model?
Self-attention is a mechanism where the model looks at all words in a sentence to decide which words are important to understand each word better, helping it capture context effectively.
Click to reveal answer
beginner
Name the two main parts of the Transformer architecture.
The Transformer has two main parts: the Encoder, which reads and understands the input data, and the Decoder, which generates the output based on the Encoder's understanding.
Click to reveal answer
intermediate
Why does the Transformer use 'positional encoding'?
Because Transformers process all words at once, positional encoding adds information about the order of words so the model knows the sequence in which words appear.
Click to reveal answer
intermediate
How does the Transformer differ from older sequence models like RNNs?
Unlike RNNs that process words one by one, Transformers look at all words simultaneously using self-attention, which allows faster training and better understanding of long-range relationships.
Click to reveal answer
What is the role of the Encoder in a Transformer?
ATo generate the output sequence
BTo read and understand the input data
CTo add positional information
DTo perform self-attention only on output
✗ Incorrect
The Encoder reads and processes the input data to create a representation that the Decoder can use.
What does self-attention help the Transformer model do?
AIgnore word order
BProcess words one at a time
CFocus on important parts of the input sequence
DReduce the size of the input
✗ Incorrect
Self-attention helps the model focus on relevant words in the sequence to understand context better.
Why is positional encoding necessary in Transformers?
ATo increase vocabulary size
BTo speed up training
CTo reduce model size
DBecause Transformers do not process data sequentially
✗ Incorrect
Positional encoding tells the model the order of words since Transformers look at all words at once.
Which part of the Transformer generates the final output?
ADecoder
BEncoder
CSelf-attention layer
DPositional encoding
✗ Incorrect
The Decoder uses the Encoder's information to produce the output sequence.
How does the Transformer improve over RNNs?
AProcesses sequences in parallel using self-attention
BProcesses sequences strictly one word at a time
CUses fewer layers
DIgnores context
✗ Incorrect
Transformers process all words simultaneously, making training faster and capturing long-range dependencies better.
Explain how self-attention works in the Transformer architecture and why it is important.
Think about how the model decides which words to focus on when reading a sentence.
You got /3 concepts.
Describe the roles of the Encoder and Decoder in the Transformer model.
Consider the flow from input to output in a translation task.
You got /3 concepts.
Practice
(1/5)
1. What is the main purpose of the attention mechanism in a Transformer model?
easy
A. To increase the size of the model
B. To focus on important parts of the input data
C. To reduce the number of layers
D. To store data permanently
Solution
Step 1: Understand attention mechanism role
The attention mechanism helps the model decide which parts of the input are important to focus on for better understanding.
Step 2: Compare options with attention purpose
Only To focus on important parts of the input data correctly describes this focus on important parts, while others describe unrelated functions.
Final Answer:
To focus on important parts of the input data -> Option B
Quick Check:
Attention = Focus on important parts [OK]
Hint: Attention means focusing on key input parts [OK]
Common Mistakes:
Thinking attention increases model size
Confusing attention with data storage
Assuming attention reduces layers
2. Which of the following is the correct order of components inside a Transformer encoder layer?
easy
A. Multi-head attention -> Feed-forward network -> Layer normalization
B. Feed-forward network -> Multi-head attention -> Layer normalization
C. Multi-head attention -> Layer normalization -> Feed-forward network
D. Layer normalization -> Multi-head attention -> Feed-forward network
The encoder layer first computes multi-head attention output, adds residual, applies layer normalization, then computes feed-forward network, followed by another residual and layer normalization.
Step 2: Match the correct sequence
The typical order is Multi-head attention -> Feed-forward network -> Layer normalization (with residual connections applied after each sub-layer).
3. Given a Transformer decoder layer with masked multi-head attention, what is the main reason for masking?
medium
A. To prevent the model from seeing future tokens during training
B. To speed up the training process
C. To increase the number of attention heads
D. To reduce the model size
Solution
Step 1: Understand masking in decoder attention
Masking hides future tokens so the model predicts the next word without cheating by looking ahead.
Step 2: Evaluate options against masking purpose
Only To prevent the model from seeing future tokens during training correctly explains masking's role to prevent future token visibility during training.
Final Answer:
To prevent the model from seeing future tokens during training -> Option A
Quick Check:
Masking = Hide future tokens [OK]
Hint: Masking hides future words in decoder [OK]
Common Mistakes:
Thinking masking speeds training
Confusing masking with model size reduction
Assuming masking adds attention heads
4. Consider this simplified Transformer encoder code snippet in Python:
import torch
import torch.nn as nn
class SimpleEncoder(nn.Module):
def __init__(self):
super().__init__()
self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=2)
def forward(self, x):
attn_output, _ = self.attention(x, x, x)
return attn_output
x = torch.rand(5, 3, 8) # sequence length=5, batch=3, embed=8
model = SimpleEncoder()
output = model(x)
print(output.shape)
What is the error in this code?
medium
A. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (batch size, sequence length, embedding)
B. Input shape to MultiheadAttention should be (batch size, sequence length, embedding), but x is (5, 3, 8)
C. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (5, 3, 8)
D. Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct
Solution
Step 1: Check expected input shape for nn.MultiheadAttention
The tensor x has shape (5, 3, 8) which matches (sequence length=5, batch size=3, embedding=8), so it is correct.
Final Answer:
Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct -> Option D
Quick Check:
Input shape = (seq_len, batch, embed) [OK]
Hint: MultiheadAttention input shape is (seq_len, batch, embed) [OK]
Common Mistakes:
Confusing batch and sequence length order
Assuming batch size is first dimension
Mixing embedding dimension position
5. You want to build a Transformer model for translating short sentences. Which combination of components is essential in the Transformer architecture to handle this task effectively?
hard
A. Feed-forward networks only without attention
B. Only encoder layers with feed-forward networks
C. Encoder with self-attention, decoder with masked self-attention, and cross-attention layers
D. Decoder layers without attention mechanisms
Solution
Step 1: Identify components needed for translation
Translation requires understanding input (encoder) and generating output (decoder) with attention to both input and generated tokens.
Step 2: Match components to translation needs
Encoder with self-attention, decoder with masked self-attention, and cross-attention layers includes encoder self-attention, decoder masked self-attention to prevent future token peeking, and cross-attention to link encoder and decoder outputs, which is essential.
Final Answer:
Encoder with self-attention, decoder with masked self-attention, and cross-attention layers -> Option C
Quick Check:
Translation needs encoder, decoder, and cross-attention [OK]
Hint: Translation needs encoder, decoder, and cross-attention [OK]