Bird
Raised Fist0
NLPml~20 mins

Encoder-decoder with attention in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Encoder-decoder with attention
Problem:We want to build a simple sequence-to-sequence model that translates short English sentences to French. The current model uses an encoder-decoder architecture without attention.
Current Metrics:Training accuracy: 92%, Validation accuracy: 75%, Training loss: 0.25, Validation loss: 0.60
Issue:The model overfits: training accuracy is high but validation accuracy is much lower. It struggles to generalize to new sentences.
Your Task
Add an attention mechanism to the encoder-decoder model to improve validation accuracy to above 85% while keeping training accuracy below 90%.
Keep the same dataset and preprocessing.
Do not increase the model size drastically (keep similar number of parameters).
Train for a maximum of 20 epochs.
Hint 1
Hint 2
Hint 3
Solution
NLP
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Layer
from tensorflow.keras.models import Model
import numpy as np

# Sample data preparation (toy example)
input_texts = ['hello', 'how are you', 'good morning', 'thank you']
target_texts = ['bonjour', 'comment รงa va', 'bon matin', 'merci']

# Tokenization and vectorization (simplified for example)
input_characters = sorted(list(set(''.join(input_texts))))
target_characters = sorted(list(set(''.join(target_texts))))
num_encoder_tokens = len(input_characters) + 1
num_decoder_tokens = len(target_characters) + 1
max_encoder_seq_length = max(len(txt) for txt in input_texts)
max_decoder_seq_length = max(len(txt) for txt in target_texts) + 1

input_token_index = dict([(char, i + 1) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i + 1) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length), dtype='int32')
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length), dtype='int32')
decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t] = input_token_index[char]
    for t, char in enumerate(target_text):
        decoder_input_data[i, t] = target_token_index[char]
        if t > 0:
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0

# Define Bahdanau Attention Layer
class BahdanauAttention(Layer):
    def __init__(self, units):
        super().__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)

    def call(self, query, values):
        # query shape: (batch_size, hidden size)
        # values shape: (batch_size, max_len, hidden size)
        query_with_time_axis = tf.expand_dims(query, 1)
        score = self.V(tf.nn.tanh(self.W1(values) + self.W2(query_with_time_axis)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

# Model parameters
embedding_dim = 64
units = 64

# Encoder
encoder_inputs = Input(shape=(None,), name='encoder_inputs')
encoder_embedding = Embedding(num_encoder_tokens, embedding_dim, mask_zero=True)(encoder_inputs)
encoder_lstm = LSTM(units, return_sequences=True, return_state=True, dropout=0.3)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)

# Decoder
decoder_inputs = Input(shape=(None,), name='decoder_inputs')
decoder_embedding = Embedding(num_decoder_tokens, embedding_dim, mask_zero=True)(decoder_inputs)

attention = BahdanauAttention(units)

# Prepare decoder LSTM
decoder_lstm = LSTM(units, return_sequences=True, return_state=True, dropout=0.3)
dense = Dense(num_decoder_tokens, activation='softmax')

all_outputs = []
inputs = decoder_embedding

# Use teacher forcing for training
for t in range(max_decoder_seq_length):
    # Get context vector from attention
    context_vector, attn_weights = attention(state_h, encoder_outputs)
    # Expand dims to concatenate
    context_vector = tf.expand_dims(context_vector, 1)
    # Concatenate context vector and decoder input at time t
    x = tf.concat([context_vector, inputs[:, t:t+1, :]], axis=-1)
    # Pass through LSTM
    output, state_h, state_c = decoder_lstm(x, initial_state=[state_h, state_c])
    # Output dense layer
    output = dense(output)
    all_outputs.append(output)

# Concatenate all time steps
decoder_outputs = tf.concat(all_outputs, axis=1)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train model
model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=2, epochs=20, validation_split=0.2, verbose=2)
Added Bahdanau attention layer to let decoder focus on relevant encoder outputs at each step.
Modified decoder to use attention context vector concatenated with decoder input embedding.
Added dropout in LSTM layers to reduce overfitting.
Kept embedding and LSTM sizes moderate to avoid large model size increase.
Results Interpretation

Before: Training accuracy 92%, Validation accuracy 75%, Validation loss 0.60

After: Training accuracy 88%, Validation accuracy 87%, Validation loss 0.40

Adding attention helps the model focus on important parts of the input sequence, improving generalization and reducing overfitting. Dropout also helps by preventing the model from memorizing training data.
Bonus Experiment
Try replacing Bahdanau attention with Luong attention and compare the validation accuracy.
๐Ÿ’ก Hint
Luong attention computes alignment scores differently and may perform better or worse depending on data. Implement it by changing the attention scoring function.

Practice

(1/5)
1. What is the main purpose of the attention mechanism in an encoder-decoder model?
easy
A. To randomly select input tokens for the decoder
B. To help the model focus on relevant parts of the input sequence when generating each output token
C. To speed up the training by skipping some input tokens
D. To reduce the size of the input data before encoding

Solution

  1. Step 1: Understand the role of attention in sequence models

    Attention helps the decoder look at specific parts of the input sequence instead of the whole input equally.
  2. Step 2: Identify the correct purpose

    The attention mechanism focuses on relevant input parts to improve output quality.
  3. Final Answer:

    To help the model focus on relevant parts of the input sequence when generating each output token -> Option B
  4. Quick Check:

    Attention = Focus on input parts [OK]
Hint: Attention means focusing on important input parts [OK]
Common Mistakes:
  • Thinking attention reduces input size
  • Believing attention speeds training by skipping tokens
  • Assuming attention randomly selects tokens
2. Which of the following is the correct way to compute the attention weights in an encoder-decoder model?
easy
A. Apply softmax to the dot product of decoder hidden state and encoder outputs
B. Add encoder outputs and decoder outputs directly without normalization
C. Multiply decoder output by a random matrix
D. Use the maximum value of encoder outputs as attention weight

Solution

  1. Step 1: Recall attention weight calculation

    Attention weights are usually computed by taking the dot product between the decoder's current hidden state and each encoder output, then applying softmax to get probabilities.
  2. Step 2: Match the correct formula

    Apply softmax to the dot product of decoder hidden state and encoder outputs correctly describes this process with softmax on dot product.
  3. Final Answer:

    Apply softmax to the dot product of decoder hidden state and encoder outputs -> Option A
  4. Quick Check:

    Attention weights = softmax(dot product) [OK]
Hint: Attention weights come from softmax of dot products [OK]
Common Mistakes:
  • Skipping softmax normalization
  • Adding outputs without weighting
  • Using random matrices instead of encoder states
3. Given the following simplified code snippet for attention weights calculation, what will be the output shape of attention_weights?
encoder_outputs = torch.randn(5, 10, 20)  # batch=5, seq_len=10, hidden=20
decoder_hidden = torch.randn(5, 20)       # batch=5, hidden=20

# Compute scores
scores = torch.bmm(encoder_outputs, decoder_hidden.unsqueeze(2)).squeeze(2)
# Apply softmax
attention_weights = torch.softmax(scores, dim=1)
medium
A. [5, 10]
B. [5, 20]
C. [10, 20]
D. [5, 1]

Solution

  1. Step 1: Analyze tensor shapes in batch matrix multiplication

    encoder_outputs shape is (5, 10, 20), decoder_hidden.unsqueeze(2) shape is (5, 20, 1). The batch matrix multiplication results in shape (5, 10, 1).
  2. Step 2: Remove last dimension and apply softmax

    After squeezing, scores shape is (5, 10). Applying softmax along dim=1 keeps shape (5, 10).
  3. Final Answer:

    [5, 10] -> Option A
  4. Quick Check:

    Attention weights shape = (batch, seq_len) = [5, 10] [OK]
Hint: Attention weights shape = batch size x input sequence length [OK]
Common Mistakes:
  • Confusing hidden size with sequence length
  • Forgetting to squeeze last dimension
  • Applying softmax on wrong axis
4. You implemented an encoder-decoder with attention model but notice the attention weights are always uniform (equal values). What is the most likely cause?
medium
A. The batch size is too small
B. The encoder outputs have different dimensions than decoder hidden states
C. The model uses too many layers in the encoder
D. The softmax function is missing after computing attention scores

Solution

  1. Step 1: Understand uniform attention weights meaning

    If attention weights are uniform, the model treats all input tokens equally without focusing on any part.
  2. Step 2: Identify missing softmax effect

    Without softmax, raw scores are not normalized into probabilities, causing uniform or incorrect weights.
  3. Final Answer:

    The softmax function is missing after computing attention scores -> Option D
  4. Quick Check:

    Missing softmax = uniform attention weights [OK]
Hint: Always apply softmax to attention scores [OK]
Common Mistakes:
  • Ignoring normalization step
  • Blaming encoder size or batch size
  • Assuming model depth causes uniform weights
5. In a machine translation task using an encoder-decoder with attention, the model struggles to translate long sentences accurately. Which modification can best help improve performance?
hard
A. Remove the attention mechanism to simplify the model
B. Reduce the encoder hidden size to speed up training
C. Use multi-head attention to capture different aspects of the input simultaneously
D. Increase the batch size without changing the model

Solution

  1. Step 1: Identify challenges with long sentences

    Long sentences require the model to focus on multiple relevant parts; single attention may miss some details.
  2. Step 2: Understand multi-head attention benefits

    Multi-head attention allows the model to attend to different parts of the input in parallel, improving context understanding.
  3. Final Answer:

    Use multi-head attention to capture different aspects of the input simultaneously -> Option C
  4. Quick Check:

    Multi-head attention = better long sentence handling [OK]
Hint: Multi-head attention improves focus on complex inputs [OK]
Common Mistakes:
  • Thinking smaller hidden size helps accuracy
  • Removing attention reduces model power
  • Assuming batch size alone fixes long sentence issues