Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Transformer architecture overview in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Transformer architecture overview
Which metric matters for Transformer architecture and WHY

Transformers are often used for tasks like language understanding and generation. The key metrics depend on the task:

  • For classification: Accuracy, Precision, Recall, and F1 score matter to measure how well the model predicts correct classes.
  • For sequence generation (like translation or text generation): Metrics like BLEU, ROUGE, or perplexity show how close the output is to expected text.
  • For general model quality: Loss (like cross-entropy) during training shows how well the model learns patterns.

These metrics help us know if the Transformer understands and generates text well.

Confusion matrix example for Transformer classification
      Actual \ Predicted | Positive | Negative
      -------------------|----------|---------
      Positive           |    85    |   15
      Negative           |    10    |   90
    

This shows how many times the Transformer correctly or incorrectly predicted classes.

From this matrix:

  • True Positives (TP) = 85
  • False Positives (FP) = 10
  • True Negatives (TN) = 90
  • False Negatives (FN) = 15
Precision vs Recall tradeoff with Transformer models

Imagine a Transformer used for spam detection:

  • Precision: How many emails marked as spam really are spam? High precision means fewer good emails wrongly marked as spam.
  • Recall: How many actual spam emails did the model catch? High recall means fewer spam emails slip through.

If the Transformer is tuned for high precision, it may miss some spam (lower recall). If tuned for high recall, it may mark good emails as spam (lower precision).

Choosing the right balance depends on what is worse: missing spam or wrongly blocking good emails.

What good vs bad metric values look like for Transformer tasks
  • Good classification metrics: Accuracy > 90%, Precision and Recall both above 85%, F1 score close to 0.9.
  • Bad classification metrics: Accuracy below 70%, Precision or Recall below 50%, F1 score below 0.6.
  • Good generation metrics: Low perplexity (close to 10 or less), BLEU or ROUGE scores above 0.5 (50%).
  • Bad generation metrics: High perplexity (above 100), BLEU or ROUGE scores below 0.2 (20%).

Good metrics mean the Transformer understands and predicts well. Bad metrics mean it struggles to learn or generalize.

Common pitfalls in Transformer model metrics
  • Accuracy paradox: High accuracy can be misleading if classes are imbalanced (e.g., 95% accuracy but model ignores rare class).
  • Data leakage: If test data leaks into training, metrics look unrealistically good but model fails in real use.
  • Overfitting: Very low training loss but high test loss means model memorized training data but can't generalize.
  • Ignoring task-specific metrics: Using accuracy alone for generation tasks misses quality aspects like fluency or relevance.
Self-check question

Your Transformer model has 98% accuracy but only 12% recall on detecting fraud cases. Is it good for production? Why or why not?

Answer: No, it is not good. Although accuracy is high, the model misses 88% of fraud cases (low recall). For fraud detection, catching fraud (high recall) is critical to avoid losses. This model would let most fraud slip through.

Key Result
For Transformers, task-specific metrics like precision, recall, and loss reveal true model quality beyond simple accuracy.

Practice

(1/5)
1. What is the main purpose of the attention mechanism in a Transformer model?
easy
A. To increase the size of the model
B. To focus on important parts of the input data
C. To reduce the number of layers
D. To store data permanently

Solution

  1. Step 1: Understand attention mechanism role

    The attention mechanism helps the model decide which parts of the input are important to focus on for better understanding.
  2. Step 2: Compare options with attention purpose

    Only To focus on important parts of the input data correctly describes this focus on important parts, while others describe unrelated functions.
  3. Final Answer:

    To focus on important parts of the input data -> Option B
  4. Quick Check:

    Attention = Focus on important parts [OK]
Hint: Attention means focusing on key input parts [OK]
Common Mistakes:
  • Thinking attention increases model size
  • Confusing attention with data storage
  • Assuming attention reduces layers
2. Which of the following is the correct order of components inside a Transformer encoder layer?
easy
A. Multi-head attention -> Feed-forward network -> Layer normalization
B. Feed-forward network -> Multi-head attention -> Layer normalization
C. Multi-head attention -> Layer normalization -> Feed-forward network
D. Layer normalization -> Multi-head attention -> Feed-forward network

Solution

  1. Step 1: Recall Transformer encoder layer structure

    The encoder layer first computes multi-head attention output, adds residual, applies layer normalization, then computes feed-forward network, followed by another residual and layer normalization.
  2. Step 2: Match the correct sequence

    The typical order is Multi-head attention -> Feed-forward network -> Layer normalization (with residual connections applied after each sub-layer).
  3. Final Answer:

    Multi-head attention -> Feed-forward network -> Layer normalization -> Option A
  4. Quick Check:

    Encoder order = Attn -> FFN -> Norm [OK]
Hint: Encoder: attn -> feed-forward -> norm [OK]
Common Mistakes:
  • Mixing up the order of feed-forward and attention
  • Placing layer normalization incorrectly
  • Assuming normalization comes first
3. Given a Transformer decoder layer with masked multi-head attention, what is the main reason for masking?
medium
A. To prevent the model from seeing future tokens during training
B. To speed up the training process
C. To increase the number of attention heads
D. To reduce the model size

Solution

  1. Step 1: Understand masking in decoder attention

    Masking hides future tokens so the model predicts the next word without cheating by looking ahead.
  2. Step 2: Evaluate options against masking purpose

    Only To prevent the model from seeing future tokens during training correctly explains masking's role to prevent future token visibility during training.
  3. Final Answer:

    To prevent the model from seeing future tokens during training -> Option A
  4. Quick Check:

    Masking = Hide future tokens [OK]
Hint: Masking hides future words in decoder [OK]
Common Mistakes:
  • Thinking masking speeds training
  • Confusing masking with model size reduction
  • Assuming masking adds attention heads
4. Consider this simplified Transformer encoder code snippet in Python:
import torch
import torch.nn as nn

class SimpleEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=2)
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        return attn_output

x = torch.rand(5, 3, 8)  # sequence length=5, batch=3, embed=8
model = SimpleEncoder()
output = model(x)
print(output.shape)
What is the error in this code?
medium
A. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (batch size, sequence length, embedding)
B. Input shape to MultiheadAttention should be (batch size, sequence length, embedding), but x is (5, 3, 8)
C. MultiheadAttention requires input shape (sequence length, batch size, embedding), but x is (5, 3, 8)
D. Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct

Solution

  1. Step 1: Check expected input shape for nn.MultiheadAttention

    PyTorch's MultiheadAttention expects input shape (sequence length, batch size, embedding dimension).
  2. Step 2: Verify input tensor shape

    The tensor x has shape (5, 3, 8) which matches (sequence length=5, batch size=3, embedding=8), so it is correct.
  3. Final Answer:

    Input shape to MultiheadAttention should be (sequence length, batch size, embedding), but x is (5, 3, 8) which is correct -> Option D
  4. Quick Check:

    Input shape = (seq_len, batch, embed) [OK]
Hint: MultiheadAttention input shape is (seq_len, batch, embed) [OK]
Common Mistakes:
  • Confusing batch and sequence length order
  • Assuming batch size is first dimension
  • Mixing embedding dimension position
5. You want to build a Transformer model for translating short sentences. Which combination of components is essential in the Transformer architecture to handle this task effectively?
hard
A. Feed-forward networks only without attention
B. Only encoder layers with feed-forward networks
C. Encoder with self-attention, decoder with masked self-attention, and cross-attention layers
D. Decoder layers without attention mechanisms

Solution

  1. Step 1: Identify components needed for translation

    Translation requires understanding input (encoder) and generating output (decoder) with attention to both input and generated tokens.
  2. Step 2: Match components to translation needs

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers includes encoder self-attention, decoder masked self-attention to prevent future token peeking, and cross-attention to link encoder and decoder outputs, which is essential.
  3. Final Answer:

    Encoder with self-attention, decoder with masked self-attention, and cross-attention layers -> Option C
  4. Quick Check:

    Translation needs encoder, decoder, and cross-attention [OK]
Hint: Translation needs encoder, decoder, and cross-attention [OK]
Common Mistakes:
  • Ignoring decoder or cross-attention layers
  • Using only feed-forward networks
  • Skipping masking in decoder