Bird
Raised Fist0
NLPml~5 mins

Transformer architecture in NLP

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Transformers help computers understand and generate language by looking at all words in a sentence at once, making learning faster and better.
Translating sentences from one language to another quickly and accurately.
Summarizing long articles into short, clear points.
Answering questions based on a given text.
Generating text like writing stories or emails automatically.
Understanding the meaning of words in different contexts.
Syntax
NLP
class Transformer(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.encoder = Encoder(...)
        self.decoder = Decoder(...)

    def forward(self, src, tgt):
        enc_output = self.encoder(src)
        output = self.decoder(tgt, enc_output)
        return output
The Transformer has two main parts: encoder and decoder.
It uses 'self-attention' to focus on important words in the sentence.
Examples
The encoder processes the input sentence and creates a representation.
NLP
encoder_output = encoder(src_sequence)
The decoder uses the encoder's output and the target sequence to predict the next words.
NLP
decoder_output = decoder(tgt_sequence, encoder_output)
The full Transformer model takes input and target sequences to produce predictions.
NLP
output = transformer(src_sequence, tgt_sequence)
Sample Model
This code builds a simple Transformer encoder model that takes a sequence of numbers representing words and predicts the next words. It prints the shape of the output and the probabilities for the first word in the sequence.
NLP
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers):
        super().__init__()
        self.d_model = embed_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # Positional encoding
        max_len = 5000
        pe = torch.zeros(max_len, self.d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, self.d_model, 2, dtype=torch.float) * (-math.log(10000.0) / self.d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pos_encoder', pe.unsqueeze(1))
        encoder_layer = nn.TransformerEncoderLayer(d_model=self.d_model, nhead=num_heads, dim_feedforward=hidden_dim)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc_out = nn.Linear(self.d_model, vocab_size)

    def forward(self, src):
        embedded = self.embedding(src) * math.sqrt(self.d_model)
        embedded = embedded + self.pos_encoder[:embedded.size(0)]  # (seq_len, batch, embed_size)
        encoded = self.encoder(embedded)  # (seq_len, batch, embed_size)
        output = self.fc_out(encoded)  # (seq_len, batch, vocab_size)
        return output

# Sample data: batch size 1, sequence length 5
vocab_size = 10
embed_size = 8
num_heads = 2
hidden_dim = 16
num_layers = 1

model = SimpleTransformer(vocab_size, embed_size, num_heads, hidden_dim, num_layers)

# Input sequence of token ids (seq_len=5, batch=1)
src = torch.tensor([[1, 2, 3, 4, 5]]).T  # shape (5,1)

output = model(src)  # shape (5,1,vocab_size)

# Convert output logits to probabilities
probs = F.softmax(output, dim=-1)

# Print shape and first token probabilities
print(f"Output shape: {output.shape}")
print(f"Probabilities for first token:\n{probs[0,0].detach().numpy()}")
OutputSuccess
Important Notes
Transformers do not process words one by one but all at once, which helps them learn context better.
Self-attention lets the model decide which words to focus on for each word it processes.
Positional information is added because Transformers do not know word order by default.
Summary
Transformers use self-attention to understand all words in a sentence together.
They have encoder and decoder parts for processing input and generating output.
They are very good for tasks like translation, summarization, and text generation.

Practice

(1/5)
1. What is the main purpose of the self-attention mechanism in a Transformer model?
easy
A. To increase the number of layers in the model
B. To reduce the size of the input data
C. To convert words into numbers
D. To let the model focus on different words in the sentence at the same time

Solution

  1. Step 1: Understand self-attention role

    Self-attention helps the model look at all words together and decide which words are important for each word.
  2. Step 2: Match purpose with options

    To let the model focus on different words in the sentence at the same time correctly describes this as focusing on different words simultaneously, unlike other options which describe unrelated tasks.
  3. Final Answer:

    To let the model focus on different words in the sentence at the same time -> Option D
  4. Quick Check:

    Self-attention = focus on words together [OK]
Hint: Self-attention means focusing on all words at once [OK]
Common Mistakes:
  • Thinking self-attention reduces input size
  • Confusing self-attention with embedding
  • Assuming it increases model layers
2. Which of the following is the correct way to describe the Transformer architecture components?
easy
A. It has encoder and decoder parts
B. It has only an encoder part
C. It uses only convolutional layers
D. It uses recurrent neural networks

Solution

  1. Step 1: Recall Transformer structure

    Transformers have two main parts: encoder to process input and decoder to generate output.
  2. Step 2: Compare options with structure

    It has encoder and decoder parts correctly states the presence of both encoder and decoder; others mention incorrect or unrelated components.
  3. Final Answer:

    It has encoder and decoder parts -> Option A
  4. Quick Check:

    Transformer = encoder + decoder [OK]
Hint: Remember: Transformer = encoder + decoder [OK]
Common Mistakes:
  • Thinking Transformer has only encoder
  • Confusing Transformer with CNN or RNN
  • Ignoring decoder role
3. Consider this simplified Transformer encoder code snippet in Python using PyTorch:
import torch
from torch import nn

class SimpleEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=4, num_heads=2)
    def forward(self, x):
        attn_output, _ = self.attention(x, x, x)
        return attn_output

x = torch.rand(5, 3, 4)  # sequence length=5, batch=3, embed=4
model = SimpleEncoder()
output = model(x)
print(output.shape)
What will be the printed output shape?
medium
A. torch.Size([3, 5, 4])
B. torch.Size([5, 3, 4])
C. torch.Size([5, 4, 3])
D. torch.Size([3, 4, 5])

Solution

  1. Step 1: Understand input shape and MultiheadAttention

    Input shape is (sequence length=5, batch=3, embedding=4). PyTorch MultiheadAttention expects (seq_len, batch, embed).
  2. Step 2: Output shape matches input shape

    MultiheadAttention returns output with the same shape as input: (5, 3, 4).
  3. Final Answer:

    torch.Size([5, 3, 4]) -> Option B
  4. Quick Check:

    Output shape = input shape for MultiheadAttention [OK]
Hint: MultiheadAttention output shape matches input shape [OK]
Common Mistakes:
  • Mixing batch and sequence dimensions
  • Assuming output shape changes embedding size
  • Confusing PyTorch input format
4. You have this Transformer decoder code snippet that throws an error:
import torch
from torch import nn

class SimpleDecoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim=8, num_heads=4)
    def forward(self, tgt, memory):
        attn_output, _ = self.attention(tgt, memory, memory)
        return attn_output

tgt = torch.rand(10, 2, 8)  # target seq len=10, batch=2, embed=8
memory = torch.rand(5, 3, 8)  # memory seq len=5, batch=3, embed=8
model = SimpleDecoder()
output = model(tgt, memory)
print(output.shape)
What is the likely cause of the error?
medium
A. Sequence length mismatch between tgt and memory
B. Mismatch in embedding dimensions between tgt and memory
C. Batch size mismatch between tgt and memory
D. Number of attention heads is too high

Solution

  1. Step 1: Check shapes of tgt and memory

    tgt=(10,2,8), memory=(5,3,8). Both have embedding size 8, sequence lengths differ (10 vs 5, allowed), but batch sizes differ (2 vs 3).
  2. Step 2: Identify batch size mismatch

    Batch size mismatch between tgt (batch=2) and memory (batch=3) causes the RuntimeError in MultiheadAttention.
  3. Step 3: Re-examine options carefully

    Embedding sizes match, sequence length mismatch is allowed, number of heads is valid. Batch size mismatch is most common error in such cases.
  4. Final Answer:

    Batch size mismatch between tgt and memory -> Option C
  5. Quick Check:

    Batch sizes must match for attention [OK]
Hint: Check batch sizes first when attention errors occur [OK]
Common Mistakes:
  • Assuming sequence length must match
  • Blaming embedding size mismatch incorrectly
  • Thinking number of heads causes shape errors
5. You want to build a Transformer model for text summarization. Which combination of components is best suited for this task?
hard
A. Encoder-decoder, because summarization needs understanding input and generating output
B. Decoder only, because summarization is text generation
C. Neither encoder nor decoder, use RNN instead
D. Encoder only, because summarization needs understanding input only

Solution

  1. Step 1: Understand summarization task

    Summarization requires reading input text (encoding) and producing a shorter text (decoding).
  2. Step 2: Match task with Transformer parts

    Encoder-decoder architecture fits best as encoder understands input and decoder generates summary output.
  3. Final Answer:

    Encoder-decoder, because summarization needs understanding input and generating output -> Option A
  4. Quick Check:

    Summarization = encoder + decoder [OK]
Hint: Summarization needs both understanding and generating text [OK]
Common Mistakes:
  • Choosing encoder only for generation tasks
  • Choosing decoder only ignoring input understanding
  • Ignoring Transformer benefits and choosing RNN