Bird
Raised Fist0
Prompt Engineering / GenAIml~5 mins

Transformer architecture overview in Prompt Engineering / GenAI - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the main purpose of the Transformer architecture in AI?
The Transformer architecture is designed to process sequences of data, like sentences, by focusing on relationships between all parts of the sequence at once, enabling better understanding and generation of language.
Click to reveal answer
beginner
What does 'self-attention' mean in the Transformer model?
Self-attention is a mechanism where the model looks at all words in a sentence to decide which words are important to understand each word better, helping it capture context effectively.
Click to reveal answer
beginner
Name the two main parts of the Transformer architecture.
The Transformer has two main parts: the Encoder, which reads and understands the input data, and the Decoder, which generates the output based on the Encoder's understanding.
Click to reveal answer
intermediate
Why does the Transformer use 'positional encoding'?
Because Transformers process all words at once, positional encoding adds information about the order of words so the model knows the sequence in which words appear.
Click to reveal answer
intermediate
How does the Transformer differ from older sequence models like RNNs?
Unlike RNNs that process words one by one, Transformers look at all words simultaneously using self-attention, which allows faster training and better understanding of long-range relationships.
Click to reveal answer
What is the role of the Encoder in a Transformer?
ATo generate the output sequence
BTo read and understand the input data
CTo add positional information
DTo perform self-attention only on output
What does self-attention help the Transformer model do?
AIgnore word order
BProcess words one at a time
CFocus on important parts of the input sequence
DReduce the size of the input
Why is positional encoding necessary in Transformers?
ATo increase vocabulary size
BTo speed up training
CTo reduce model size
DBecause Transformers do not process data sequentially
Which part of the Transformer generates the final output?
ADecoder
BEncoder
CSelf-attention layer
DPositional encoding
How does the Transformer improve over RNNs?
AProcesses sequences in parallel using self-attention
BProcesses sequences strictly one word at a time
CUses fewer layers
DIgnores context
Explain how self-attention works in the Transformer architecture and why it is important.
Think about how the model decides which words to focus on when reading a sentence.
You got /3 concepts.
    Describe the roles of the Encoder and Decoder in the Transformer model.
    Consider the flow from input to output in a translation task.
    You got /3 concepts.

      Practice

      (1/5)
      1. What is the main purpose of the attention mechanism in a Transformer model?
      easy
      A. To increase the size of the model
      B. To focus on important parts of the input data
      C. To reduce the number of layers
      D. To store data permanently

      Solution

      1. Step 1: Understand attention mechanism role

        The attention mechanism helps the model decide which parts of the input are important to focus on for better understanding.
      2. Step 2: Compare options with attention purpose

        Only To focus on important parts of the input data correctly describes this focus on important parts, while others describe unrelated functions.
      3. Final Answer:

        To focus on important parts of the input data -> Option B
      4. Quick Check:

        Attention = Focus on important parts [OK]
      Hint: Attention means focusing on key input parts [OK]
      Common Mistakes:
      • Thinking attention increases model size
      • Confusing attention with data storage
      • Assuming attention reduces layers
      2. Which of the following is the correct order of components inside a Transformer encoder layer?
      easy
      A. Multi-head attention -> Feed-forward network -> Layer normalization
      B. Feed-forward network -> Multi-head attention -> Layer normalization
      C. Multi-head attention -> Layer normalization -> Feed-forward network
      D. Layer normalization -> Multi-head attention -> Feed-forward network

      Solution

      1. Step 1: Recall Transformer encoder layer structure

        The encoder layer first computes multi-head attention output, adds residual, applies layer normalization, then computes feed-forward network, followed by another residual and layer normalization.
      2. Step 2: Match the correct sequence

        The typical order is Multi-head attention -> Feed-forward network -> Layer normalization (with residual connections applied after each sub-layer).
      3. Final Answer:

        Multi-head attention -> Feed-forward network -> Layer normalization -> Option A
      4. Quick Check:

        Encoder order = Attn -> FFN -> Norm [OK]
      Hint: Encoder: attn -> feed-forward -> norm [OK]
      Common Mistakes:
      • Mixing up the order of feed-forward and attention
      • Placing layer normalization incorrectly
      • Assuming normalization comes first
      3. Given a Transformer decoder layer with masked multi-head attention, what is the main reason for masking?
      medium
      A. To prevent the model from seeing future tokens during training
      B. To speed up the training process
      C. To increase the number of attention heads
      D. To reduce the model size

      Solution

      1. Step 1: Understand masking in decoder attention

        Masking hides future tokens so the model predicts the next word without cheating by looking ahead.
      2. Step 2: Evaluate options against masking purpose

        Only To prevent the model from seeing future tokens during training correctly explains masking's role to prevent future token visibility during training.
      3. Final Answer:

        To prevent the model from seeing future tokens during training -> Option A
      4. Quick Check:

        Masking = Hide future tokens [OK]
      Hint: Masking hides future words in decoder [OK]
      Common Mistakes:
      • Thinking masking speeds training
      • Confusing masking with model size reduction
      • Assuming masking adds attention heads
      4. Consider this simplified Transformer encoder code snippet in Python:
      import torch
      import torch.nn as nn
      
      class SimpleEncoder(nn.Module):
          def __init__(self):
              super().__init__()
              self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=2)
          def forward(self, x):
              attn_output, _ = self.attention(x, x, x)
              return attn_output
      
      x = torch.rand(5, 3, 8)  # sequence length=5, batch=3, embed=8
      model = SimpleEncoder()
      output = model(x)
      print(output.shape)
      What is the error in this code?
      medium
      A. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (batch size, sequence length, embedding)
      B. Input shape to MultiheadAttention should be (batch size, sequence length, embedding), but x is (5, 3, 8)
      C. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (5, 3, 8)
      D. Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct

      Solution

      1. Step 1: Check expected input shape for nn.MultiheadAttention

        PyTorch's MultiheadAttention expects input shape (sequence length, batch size, embedding dimension).
      2. Step 2: Verify input tensor shape

        The tensor x has shape (5, 3, 8) which matches (sequence length=5, batch size=3, embedding=8), so it is correct.
      3. Final Answer:

        Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct -> Option D
      4. Quick Check:

        Input shape = (seq_len, batch, embed) [OK]
      Hint: MultiheadAttention input shape is (seq_len, batch, embed) [OK]
      Common Mistakes:
      • Confusing batch and sequence length order
      • Assuming batch size is first dimension
      • Mixing embedding dimension position
      5. You want to build a Transformer model for translating short sentences. Which combination of components is essential in the Transformer architecture to handle this task effectively?
      hard
      A. Feed-forward networks only without attention
      B. Only encoder layers with feed-forward networks
      C. Encoder with self-attention, decoder with masked self-attention, and cross-attention layers
      D. Decoder layers without attention mechanisms

      Solution

      1. Step 1: Identify components needed for translation

        Translation requires understanding input (encoder) and generating output (decoder) with attention to both input and generated tokens.
      2. Step 2: Match components to translation needs

        Encoder with self-attention, decoder with masked self-attention, and cross-attention layers includes encoder self-attention, decoder masked self-attention to prevent future token peeking, and cross-attention to link encoder and decoder outputs, which is essential.
      3. Final Answer:

        Encoder with self-attention, decoder with masked self-attention, and cross-attention layers -> Option C
      4. Quick Check:

        Translation needs encoder, decoder, and cross-attention [OK]
      Hint: Translation needs encoder, decoder, and cross-attention [OK]
      Common Mistakes:
      • Ignoring decoder or cross-attention layers
      • Using only feed-forward networks
      • Skipping masking in decoder