Bird
Raised Fist0
Prompt Engineering / GenAIml~12 mins

Transformer architecture overview in Prompt Engineering / GenAI - Model Pipeline Trace

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Model Pipeline - Transformer architecture overview

The Transformer model processes input data by first converting words into numbers, then learning relationships between words using attention. It trains by adjusting to reduce errors and finally predicts outputs like translated sentences or answers.

Data Flow - 6 Stages
1Input tokens
1 sentence x 10 wordsConvert words to token IDs using vocabulary1 sentence x 10 tokens
["I", "love", "cats"] -> [101, 2023, 1234]
2Embedding layer
1 sentence x 10 tokensMap tokens to vectors of size 5121 sentence x 10 tokens x 512 features
[101, 2023, 1234] -> [[0.1, 0.3, ...], [0.5, 0.2, ...], [0.4, 0.7, ...]]
3Positional encoding
1 sentence x 10 tokens x 512 featuresAdd position info to embeddings1 sentence x 10 tokens x 512 features
Embedding vector + position vector for each token
4Multi-head self-attention
1 sentence x 10 tokens x 512 featuresCalculate attention scores and weighted sums1 sentence x 10 tokens x 512 features
Each token attends to others to gather context
5Feed-forward network
1 sentence x 10 tokens x 512 featuresApply two linear layers with ReLU in between1 sentence x 10 tokens x 512 features
Transform features to capture complex patterns
6Output layer
1 sentence x 10 tokens x 512 featuresProject to vocabulary size and apply softmax1 sentence x 10 tokens x 10000 classes
Predict probability for each word in vocabulary
Training Trace - Epoch by Epoch

Epoch 1: *****
Epoch 2: ****
Epoch 3: ***
Epoch 4: **
Epoch 5: *
Epoch 6: *
(Loss decreasing over epochs)
EpochLoss ↓Accuracy ↑Observation
15.20.12High loss and low accuracy at start
23.80.35Loss decreased, accuracy improved
32.70.52Model learning meaningful patterns
41.90.68Good progress, loss dropping steadily
51.30.78Model converging with better accuracy
60.90.85Loss low, accuracy high, training stable
Prediction Trace - 5 Layers
Layer 1: Tokenization
Layer 2: Embedding + Positional Encoding
Layer 3: Multi-head Self-Attention
Layer 4: Feed-forward Network
Layer 5: Output Projection + Softmax
Model Quiz - 3 Questions
Test your understanding
What does the embedding layer do in the Transformer?
ASplits sentences into words
BConverts tokens into vectors with meaning
CCalculates attention scores
DApplies softmax to output
Key Insight
The Transformer uses attention to understand relationships between words, allowing it to learn context effectively. Training reduces errors steadily, improving prediction accuracy.

Practice

(1/5)
1. What is the main purpose of the attention mechanism in a Transformer model?
easy
A. To increase the size of the model
B. To focus on important parts of the input data
C. To reduce the number of layers
D. To store data permanently

Solution

  1. Step 1: Understand attention mechanism role

    The attention mechanism helps the model decide which parts of the input are important to focus on for better understanding.
  2. Step 2: Compare options with attention purpose

    Only To focus on important parts of the input data correctly describes this focus on important parts, while others describe unrelated functions.
  3. Final Answer:

    To focus on important parts of the input data -> Option B
  4. Quick Check:

    Attention = Focus on important parts [OK]
Hint: Attention means focusing on key input parts [OK]
Common Mistakes:
  • Thinking attention increases model size
  • Confusing attention with data storage
  • Assuming attention reduces layers
2. Which of the following is the correct order of components inside a Transformer encoder layer?
easy
A. Multi-head attention -> Feed-forward network -> Layer normalization
B. Feed-forward network -> Multi-head attention -> Layer normalization
C. Multi-head attention -> Layer normalization -> Feed-forward network
D. Layer normalization -> Multi-head attention -> Feed-forward network

Solution

  1. Step 1: Recall Transformer encoder layer structure

    The encoder layer first computes multi-head attention output, adds residual, applies layer normalization, then computes feed-forward network, followed by another residual and layer normalization.
  2. Step 2: Match the correct sequence

    The typical order is Multi-head attention -> Feed-forward network -> Layer normalization (with residual connections applied after each sub-layer).
  3. Final Answer:

    Multi-head attention -> Feed-forward network -> Layer normalization -> Option A
  4. Quick Check:

    Encoder order = Attn -> FFN -> Norm [OK]
Hint: Encoder: attn -> feed-forward -> norm [OK]
Common Mistakes:
  • Mixing up the order of feed-forward and attention
  • Placing layer normalization incorrectly
  • Assuming normalization comes first
3. Given a Transformer decoder layer with masked multi-head attention, what is the main reason for masking?
medium
A. To prevent the model from seeing future tokens during training
B. To speed up the training process
C. To increase the number of attention heads
D. To reduce the model size

Solution

  1. Step 1: Understand masking in decoder attention

    Masking hides future tokens so the model predicts the next word without cheating by looking ahead.
  2. Step 2: Evaluate options against masking purpose

    Only To prevent the model from seeing future tokens during training correctly explains masking's role to prevent future token visibility during training.
  3. Final Answer:

    To prevent the model from seeing future tokens during training -> Option A
  4. Quick Check:

    Masking = Hide future tokens [OK]
Hint: Masking hides future words in decoder [OK]
Common Mistakes:
  • Thinking masking speeds training
  • Confusing masking with model size reduction
  • Assuming masking adds attention heads
4. Consider this simplified Transformer encoder code snippet in Python:
import torch
import torch.nn as nn

class SimpleEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=2)
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        return attn_output

x = torch.rand(5, 3, 8)  # sequence length=5, batch=3, embed=8
model = SimpleEncoder()
output = model(x)
print(output.shape)
What is the error in this code?
medium
A. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (batch size, sequence length, embedding)
B. Input shape to MultiheadAttention should be (batch size, sequence length, embedding), but x is (5, 3, 8)
C. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (5, 3, 8)
D. Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct

Solution

  1. Step 1: Check expected input shape for nn.MultiheadAttention

    PyTorch's MultiheadAttention expects input shape (sequence length, batch size, embedding dimension).
  2. Step 2: Verify input tensor shape

    The tensor x has shape (5, 3, 8) which matches (sequence length=5, batch size=3, embedding=8), so it is correct.
  3. Final Answer:

    Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct -> Option D
  4. Quick Check:

    Input shape = (seq_len, batch, embed) [OK]
Hint: MultiheadAttention input shape is (seq_len, batch, embed) [OK]
Common Mistakes:
  • Confusing batch and sequence length order
  • Assuming batch size is first dimension
  • Mixing embedding dimension position
5. You want to build a Transformer model for translating short sentences. Which combination of components is essential in the Transformer architecture to handle this task effectively?
hard
A. Feed-forward networks only without attention
B. Only encoder layers with feed-forward networks
C. Encoder with self-attention, decoder with masked self-attention, and cross-attention layers
D. Decoder layers without attention mechanisms

Solution

  1. Step 1: Identify components needed for translation

    Translation requires understanding input (encoder) and generating output (decoder) with attention to both input and generated tokens.
  2. Step 2: Match components to translation needs

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers includes encoder self-attention, decoder masked self-attention to prevent future token peeking, and cross-attention to link encoder and decoder outputs, which is essential.
  3. Final Answer:

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers -> Option C
  4. Quick Check:

    Translation needs encoder, decoder, and cross-attention [OK]
Hint: Translation needs encoder, decoder, and cross-attention [OK]
Common Mistakes:
  • Ignoring decoder or cross-attention layers
  • Using only feed-forward networks
  • Skipping masking in decoder