Encoder-decoder with attention helps a model focus on important parts of input when making predictions. It improves tasks like translating languages by looking at relevant words.
Encoder-decoder with attention in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
class Encoder(nn.Module): def __init__(self, ...): ... def forward(self, x): ... class Attention(nn.Module): def __init__(self, ...): ... def forward(self, encoder_outputs, decoder_hidden): ... class Decoder(nn.Module): def __init__(self, ...): ... def forward(self, input, hidden, encoder_outputs): attention_weights = self.attention(encoder_outputs, hidden) context = attention_weights @ encoder_outputs ... return output, hidden, attention_weights
The encoder processes the input sequence into a set of outputs.
The attention layer calculates weights to focus on parts of encoder outputs.
attention_weights = torch.softmax(torch.bmm(decoder_hidden.unsqueeze(1), encoder_outputs.transpose(1,2)), dim=-1)
context = torch.bmm(attention_weights, encoder_outputs)
This code builds a simple encoder-decoder model with attention for sequence tasks. It runs one training step on toy data and prints the total loss.
import torch import torch.nn as nn import torch.optim as optim # Simple Encoder class Encoder(nn.Module): def __init__(self, input_dim, emb_dim, hid_dim): super().__init__() self.embedding = nn.Embedding(input_dim, emb_dim) self.rnn = nn.GRU(emb_dim, hid_dim, batch_first=True) def forward(self, src): embedded = self.embedding(src) outputs, hidden = self.rnn(embedded) return outputs, hidden # Attention Layer class Attention(nn.Module): def __init__(self, hid_dim): super().__init__() self.attn = nn.Linear(hid_dim * 2, hid_dim) self.v = nn.Linear(hid_dim, 1, bias=False) def forward(self, hidden, encoder_outputs): src_len = encoder_outputs.shape[1] hidden = hidden.permute(1, 0, 2) # (batch, 1, hid_dim) hidden = hidden.repeat(1, src_len, 1) # (batch, src_len, hid_dim) energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2))) attention = self.v(energy).squeeze(2) return torch.softmax(attention, dim=1) # Decoder with Attention class Decoder(nn.Module): def __init__(self, output_dim, emb_dim, hid_dim, attention): super().__init__() self.output_dim = output_dim self.embedding = nn.Embedding(output_dim, emb_dim) self.rnn = nn.GRU(hid_dim + emb_dim, hid_dim, batch_first=True) self.fc_out = nn.Linear(hid_dim * 2 + emb_dim, output_dim) self.attention = attention def forward(self, input, hidden, encoder_outputs): input = input.unsqueeze(1) # (batch, 1) embedded = self.embedding(input) # (batch, 1, emb_dim) attn_weights = self.attention(hidden, encoder_outputs) # (batch, src_len) attn_weights = attn_weights.unsqueeze(1) # (batch, 1, src_len) context = torch.bmm(attn_weights, encoder_outputs) # (batch, 1, hid_dim) rnn_input = torch.cat((embedded, context), dim=2) # (batch, 1, emb_dim + hid_dim) output, hidden = self.rnn(rnn_input, hidden) # output: (batch,1,hid_dim) output = output.squeeze(1) # (batch, hid_dim) context = context.squeeze(1) # (batch, hid_dim) embedded = embedded.squeeze(1) # (batch, emb_dim) pred_input = torch.cat((output, context, embedded), dim=1) # (batch, hid_dim*2 + emb_dim) prediction = self.fc_out(pred_input) # (batch, output_dim) return prediction, hidden, attn_weights.squeeze(1) # Toy data and training loop INPUT_DIM = 10 OUTPUT_DIM = 10 EMB_DIM = 8 HID_DIM = 16 encoder = Encoder(INPUT_DIM, EMB_DIM, HID_DIM) attention = Attention(HID_DIM) decoder = Decoder(OUTPUT_DIM, EMB_DIM, HID_DIM, attention) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters())) # Example input: batch size 2, sequence length 5 src = torch.tensor([[1,2,3,4,5],[5,4,3,2,1]]) tgt = torch.tensor([[1,2,3,4,5],[5,4,3,2,1]]) encoder_outputs, hidden = encoder(src) input_decoder = tgt[:,0] # first token loss_total = 0 for t in range(1, tgt.shape[1]): output, hidden, attn_weights = decoder(input_decoder, hidden, encoder_outputs) loss = criterion(output, tgt[:,t]) loss_total += loss.item() input_decoder = tgt[:,t] # teacher forcing print(f"Total loss: {loss_total:.4f}")
Attention helps the decoder look at different parts of the input for each output word.
Teacher forcing means using the true previous word as input during training.
Batch size and sequence length must be consistent in inputs.
Encoder-decoder with attention improves sequence tasks by focusing on important input parts.
Attention weights show where the model looks when predicting each output.
This method is widely used in translation, summarization, and more.
Practice
Solution
Step 1: Understand the role of attention in sequence models
Attention helps the decoder look at specific parts of the input sequence instead of the whole input equally.Step 2: Identify the correct purpose
The attention mechanism focuses on relevant input parts to improve output quality.Final Answer:
To help the model focus on relevant parts of the input sequence when generating each output token -> Option BQuick Check:
Attention = Focus on input parts [OK]
- Thinking attention reduces input size
- Believing attention speeds training by skipping tokens
- Assuming attention randomly selects tokens
Solution
Step 1: Recall attention weight calculation
Attention weights are usually computed by taking the dot product between the decoder's current hidden state and each encoder output, then applying softmax to get probabilities.Step 2: Match the correct formula
Apply softmax to the dot product of decoder hidden state and encoder outputs correctly describes this process with softmax on dot product.Final Answer:
Apply softmax to the dot product of decoder hidden state and encoder outputs -> Option AQuick Check:
Attention weights = softmax(dot product) [OK]
- Skipping softmax normalization
- Adding outputs without weighting
- Using random matrices instead of encoder states
attention_weights?
encoder_outputs = torch.randn(5, 10, 20) # batch=5, seq_len=10, hidden=20 decoder_hidden = torch.randn(5, 20) # batch=5, hidden=20 # Compute scores scores = torch.bmm(encoder_outputs, decoder_hidden.unsqueeze(2)).squeeze(2) # Apply softmax attention_weights = torch.softmax(scores, dim=1)
Solution
Step 1: Analyze tensor shapes in batch matrix multiplication
encoder_outputs shape is (5, 10, 20), decoder_hidden.unsqueeze(2) shape is (5, 20, 1). The batch matrix multiplication results in shape (5, 10, 1).Step 2: Remove last dimension and apply softmax
After squeezing, scores shape is (5, 10). Applying softmax along dim=1 keeps shape (5, 10).Final Answer:
[5, 10] -> Option AQuick Check:
Attention weights shape = (batch, seq_len) = [5, 10] [OK]
- Confusing hidden size with sequence length
- Forgetting to squeeze last dimension
- Applying softmax on wrong axis
Solution
Step 1: Understand uniform attention weights meaning
If attention weights are uniform, the model treats all input tokens equally without focusing on any part.Step 2: Identify missing softmax effect
Without softmax, raw scores are not normalized into probabilities, causing uniform or incorrect weights.Final Answer:
The softmax function is missing after computing attention scores -> Option DQuick Check:
Missing softmax = uniform attention weights [OK]
- Ignoring normalization step
- Blaming encoder size or batch size
- Assuming model depth causes uniform weights
Solution
Step 1: Identify challenges with long sentences
Long sentences require the model to focus on multiple relevant parts; single attention may miss some details.Step 2: Understand multi-head attention benefits
Multi-head attention allows the model to attend to different parts of the input in parallel, improving context understanding.Final Answer:
Use multi-head attention to capture different aspects of the input simultaneously -> Option CQuick Check:
Multi-head attention = better long sentence handling [OK]
- Thinking smaller hidden size helps accuracy
- Removing attention reduces model power
- Assuming batch size alone fixes long sentence issues
