Practice

(1/5)

1. What is the main purpose of the self-attention mechanism in a Transformer model?

easy

A. To increase the number of layers in the model

B. To reduce the size of the input data

C. To convert words into numbers

D. To let the model focus on different words in the sentence at the same time

Solution

Step 1: Understand self-attention role
Self-attention helps the model look at all words together and decide which words are important for each word.
Step 2: Match purpose with options
To let the model focus on different words in the sentence at the same time correctly describes this as focusing on different words simultaneously, unlike other options which describe unrelated tasks.
Final Answer:
To let the model focus on different words in the sentence at the same time -> Option D
Quick Check:
Self-attention = focus on words together [OK]

Hint: Self-attention means focusing on all words at once [OK]

Common Mistakes:

Thinking self-attention reduces input size
Confusing self-attention with embedding
Assuming it increases model layers

2. Which of the following is the correct way to describe the Transformer architecture components?

easy

A. It has encoder and decoder parts

B. It has only an encoder part

C. It uses only convolutional layers

D. It uses recurrent neural networks

Solution

Step 1: Recall Transformer structure
Transformers have two main parts: encoder to process input and decoder to generate output.
Step 2: Compare options with structure
It has encoder and decoder parts correctly states the presence of both encoder and decoder; others mention incorrect or unrelated components.
Final Answer:
It has encoder and decoder parts -> Option A
Quick Check:
Transformer = encoder + decoder [OK]

Hint: Remember: Transformer = encoder + decoder [OK]

Common Mistakes:

Thinking Transformer has only encoder
Confusing Transformer with CNN or RNN
Ignoring decoder role

3. Consider this simplified Transformer encoder code snippet in Python using PyTorch:

import torch
from torch import nn

class SimpleEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=4, num_heads=2)
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        return attn_output

x = torch.rand(5, 3, 4)  # sequence length=5, batch=3, embed=4
model = SimpleEncoder()
output = model(x)
print(output.shape)

What will be the printed output shape?

medium

A. torch.Size([3, 5, 4])

B. torch.Size([5, 3, 4])

C. torch.Size([5, 4, 3])

D. torch.Size([3, 4, 5])

Solution

Step 1: Understand input shape and MultiheadAttention
Input shape is (sequence length=5, batch=3, embedding=4). PyTorch MultiheadAttention expects (seq_len, batch, embed).
Step 2: Output shape matches input shape
MultiheadAttention returns output with the same shape as input: (5, 3, 4).
Final Answer:
torch.Size([5, 3, 4]) -> Option B
Quick Check:
Output shape = input shape for MultiheadAttention [OK]

Hint: MultiheadAttention output shape matches input shape [OK]

Common Mistakes:

Mixing batch and sequence dimensions
Assuming output shape changes embedding size
Confusing PyTorch input format

4. You have this Transformer decoder code snippet that throws an error:

import torch
from torch import nn

class SimpleDecoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=4)
    def forward(self, tgt, memory):
        attn_output, _ = self.attention(tgt, memory, memory)
        return attn_output

tgt = torch.rand(10, 2, 8)  # target seq len=10, batch=2, embed=8
memory = torch.rand(5, 3, 8)  # memory seq len=5, batch=3, embed=8
model = SimpleDecoder()
output = model(tgt, memory)
print(output.shape)

What is the likely cause of the error?

medium

A. Sequence length mismatch between tgt and memory

B. Mismatch in embedding dimensions between tgt and memory

C. Batch size mismatch between tgt and memory

D. Number of attention heads is too high

Solution

Step 1: Check shapes of tgt and memory
tgt=(10,2,8), memory=(5,3,8). Both have embedding size 8, sequence lengths differ (10 vs 5, allowed), but batch sizes differ (2 vs 3).
Step 2: Identify batch size mismatch
Batch size mismatch between tgt (batch=2) and memory (batch=3) causes the RuntimeError in MultiheadAttention.
Step 3: Re-examine options carefully
Embedding sizes match, sequence length mismatch is allowed, number of heads is valid. Batch size mismatch is most common error in such cases.
Final Answer:
Batch size mismatch between tgt and memory -> Option C
Quick Check:
Batch sizes must match for attention [OK]

Hint: Check batch sizes first when attention errors occur [OK]

Common Mistakes:

Assuming sequence length must match
Blaming embedding size mismatch incorrectly
Thinking number of heads causes shape errors

5. You want to build a Transformer model for text summarization. Which combination of components is best suited for this task?

hard

A. Encoder-decoder, because summarization needs understanding input and generating output

B. Decoder only, because summarization is text generation

C. Neither encoder nor decoder, use RNN instead

D. Encoder only, because summarization needs understanding input only

Solution

Step 1: Understand summarization task
Summarization requires reading input text (encoding) and producing a shorter text (decoding).
Step 2: Match task with Transformer parts
Encoder-decoder architecture fits best as encoder understands input and decoder generates summary output.
Final Answer:
Encoder-decoder, because summarization needs understanding input and generating output -> Option A
Quick Check:
Summarization = encoder + decoder [OK]

Hint: Summarization needs both understanding and generating text [OK]

Common Mistakes:

Choosing encoder only for generation tasks
Choosing decoder only ignoring input understanding
Ignoring Transformer benefits and choosing RNN

Epoch	Loss ↓	Accuracy ↑	Observation
1	5.2	0.12	Model starts with high loss and low accuracy
2	3.8	0.28	Loss decreases, accuracy improves as model learns
3	2.7	0.45	Model captures basic word relationships
4	1.9	0.60	Attention mechanism helps improve predictions
5	1.3	0.72	Model learns complex context and syntax
6	0.9	0.81	Loss steadily decreases, accuracy rises
7	0.7	0.86	Model converges with good performance
8	0.6	0.89	Fine tuning improves accuracy further
9	0.55	0.91	Model predictions become more confident
10	0.50	0.93	Training converges with low loss and high accuracy

Transformer architecture in NLP - Model Pipeline Trace

Start learning this pattern below

Practice

Solution

Step 1: Understand self-attention role

Step 2: Match purpose with options

Final Answer:

Quick Check:

Solution

Step 1: Recall Transformer structure

Step 2: Compare options with structure

Final Answer:

Quick Check:

Solution

Step 1: Understand input shape and MultiheadAttention

Step 2: Output shape matches input shape

Final Answer:

Quick Check:

Solution

Step 1: Check shapes of tgt and memory

Step 2: Identify batch size mismatch

Step 3: Re-examine options carefully

Final Answer:

Quick Check:

Solution

Step 1: Understand summarization task

Step 2: Match task with Transformer parts

Final Answer:

Quick Check: