What is Transformer architecture in NLP?

NLPml~5 mins

Transformer architecture in NLP

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Transformers help computers understand and generate language by looking at all words in a sentence at once, making learning faster and better.

Translating sentences from one language to another quickly and accurately.

Summarizing long articles into short, clear points.

Answering questions based on a given text.

Generating text like writing stories or emails automatically.

Understanding the meaning of words in different contexts.

Syntax

NLP

class Transformer(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.encoder = Encoder(...)
        self.decoder = Decoder(...)

    def forward(self, src, tgt):
        enc_output = self.encoder(src)
        output = self.decoder(tgt, enc_output)
        return output

The Transformer has two main parts: encoder and decoder.

It uses 'self-attention' to focus on important words in the sentence.

Examples

The encoder processes the input sentence and creates a representation.

NLP

encoder_output = encoder(src_sequence)

The decoder uses the encoder's output and the target sequence to predict the next words.

NLP

decoder_output = decoder(tgt_sequence, encoder_output)

The full Transformer model takes input and target sequences to produce predictions.

NLP

output = transformer(src_sequence, tgt_sequence)

Sample Model

This code builds a simple Transformer encoder model that takes a sequence of numbers representing words and predicts the next words. It prints the shape of the output and the probabilities for the first word in the sequence.

NLP

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers):
        super().__init__()
        self.d_model = embed_size
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # Positional encoding
        max_len = 5000
        pe = torch.zeros(max_len, self.d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, self.d_model, 2, dtype=torch.float) * (-math.log(10000.0) / self.d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pos_encoder', pe.unsqueeze(1))
        encoder_layer = nn.TransformerEncoderLayer(d_model=self.d_model, nhead=num_heads, dim_feedforward=hidden_dim)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc_out = nn.Linear(self.d_model, vocab_size)

    def forward(self, src):
        embedded = self.embedding(src) * math.sqrt(self.d_model)
        embedded = embedded + self.pos_encoder[:embedded.size(0)]  # (seq_len, batch, embed_size)
        encoded = self.encoder(embedded)  # (seq_len, batch, embed_size)
        output = self.fc_out(encoded)  # (seq_len, batch, vocab_size)
        return output

# Sample data: batch size 1, sequence length 5
vocab_size = 10
embed_size = 8
num_heads = 2
hidden_dim = 16
num_layers = 1

model = SimpleTransformer(vocab_size, embed_size, num_heads, hidden_dim, num_layers)

# Input sequence of token ids (seq_len=5, batch=1)
src = torch.tensor([[1, 2, 3, 4, 5]]).T  # shape (5,1)

output = model(src)  # shape (5,1,vocab_size)

# Convert output logits to probabilities
probs = F.softmax(output, dim=-1)

# Print shape and first token probabilities
print(f"Output shape: {output.shape}")
print(f"Probabilities for first token:\n{probs[0,0].detach().numpy()}")

OutputSuccess

Important Notes

Transformers do not process words one by one but all at once, which helps them learn context better.

Self-attention lets the model decide which words to focus on for each word it processes.

Positional information is added because Transformers do not know word order by default.

Summary

Transformers use self-attention to understand all words in a sentence together.

They have encoder and decoder parts for processing input and generating output.

They are very good for tasks like translation, summarization, and text generation.

Practice

(1/5)

1. What is the main purpose of the self-attention mechanism in a Transformer model?

easy

A. To increase the number of layers in the model

B. To reduce the size of the input data

C. To convert words into numbers

D. To let the model focus on different words in the sentence at the same time

Transformer architecture in NLP

Start learning this pattern below

Practice

Solution

Step 1: Understand self-attention role

Step 2: Match purpose with options

Final Answer:

Quick Check:

Solution

Step 1: Recall Transformer structure

Step 2: Compare options with structure

Final Answer:

Quick Check:

Solution

Step 1: Understand input shape and MultiheadAttention

Step 2: Output shape matches input shape

Final Answer:

Quick Check:

Solution

Step 1: Check shapes of tgt and memory

Step 2: Identify batch size mismatch

Step 3: Re-examine options carefully

Final Answer:

Quick Check:

Solution

Step 1: Understand summarization task

Step 2: Match task with Transformer parts

Final Answer:

Quick Check: