Bird
Raised Fist0
NLPml~20 mins

Sequence-to-sequence architecture in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Sequence-to-sequence architecture
Problem:We want to build a model that can translate simple English sentences to French using a sequence-to-sequence architecture.
Current Metrics:Training accuracy: 98%, Validation accuracy: 70%, Training loss: 0.05, Validation loss: 0.45
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower, indicating poor generalization.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You can only modify the model architecture and training hyperparameters.
Do not change the dataset or preprocessing steps.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Sample data loading and preprocessing assumed here
# For demonstration, we use dummy data shapes
num_encoder_tokens = 100
num_decoder_tokens = 100
max_encoder_seq_length = 10
max_decoder_seq_length = 10

# Define encoder
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(64, return_state=True, dropout=0.3, recurrent_dropout=0.3)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]

# Define decoder
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(64, return_sequences=True, return_state=True, dropout=0.3, recurrent_dropout=0.3)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Dummy data for demonstration
X_encoder = np.random.random((1000, max_encoder_seq_length, num_encoder_tokens))
X_decoder = np.random.random((1000, max_decoder_seq_length, num_decoder_tokens))
y = np.random.random((1000, max_decoder_seq_length, num_decoder_tokens))

# Train model
history = model.fit(
    [X_encoder, X_decoder], y,
    batch_size=64,
    epochs=30,
    validation_split=0.2,
    callbacks=[early_stopping]
)
Reduced LSTM units from 256 to 64 to simplify the model.
Added dropout and recurrent dropout of 0.3 to both encoder and decoder LSTM layers to reduce overfitting.
Lowered learning rate to 0.001 for smoother training.
Added early stopping to stop training when validation loss stops improving.
Results Interpretation

Before: Training accuracy was 98% but validation accuracy was only 70%, showing overfitting.

After: Training accuracy dropped to 90%, validation accuracy improved to 87%, and validation loss decreased, indicating better generalization.

Adding dropout, reducing model size, lowering learning rate, and using early stopping help reduce overfitting and improve validation performance in sequence-to-sequence models.
Bonus Experiment
Try using a bidirectional LSTM in the encoder to see if it improves translation accuracy further.
💡 Hint
Replace the encoder LSTM with a Bidirectional wrapper and observe changes in validation accuracy.

Practice

(1/5)
1. What is the main role of the encoder in a sequence-to-sequence model?
easy
A. To generate the output sequence directly
B. To read and understand the input sequence
C. To evaluate the model's accuracy
D. To preprocess the data before training

Solution

  1. Step 1: Understand the encoder's function

    The encoder processes the input sequence and converts it into a meaningful representation.
  2. Step 2: Differentiate encoder from decoder

    The decoder uses this representation to generate the output sequence, so it does not directly read input.
  3. Final Answer:

    To read and understand the input sequence -> Option B
  4. Quick Check:

    Encoder = input reader [OK]
Hint: Encoder reads input; decoder writes output [OK]
Common Mistakes:
  • Confusing encoder with decoder
  • Thinking encoder generates output
  • Assuming encoder evaluates accuracy
2. Which of the following is the correct way to describe the decoder in a sequence-to-sequence model?
easy
A. It generates the output sequence from the encoded input
B. It encodes the input sequence into a fixed vector
C. It normalizes the input data before encoding
D. It splits the input sequence into smaller parts

Solution

  1. Step 1: Identify decoder's role

    The decoder takes the encoded input and produces the output sequence step-by-step.
  2. Step 2: Eliminate incorrect options

    Encoding is done by the encoder, not the decoder; normalization and splitting are preprocessing steps.
  3. Final Answer:

    It generates the output sequence from the encoded input -> Option A
  4. Quick Check:

    Decoder = output generator [OK]
Hint: Decoder creates output from encoder's info [OK]
Common Mistakes:
  • Mixing encoder and decoder roles
  • Confusing preprocessing with decoding
  • Assuming decoder encodes input
3. Consider this simplified pseudocode for a sequence-to-sequence model:
encoded = encoder(input_sequence)
output = decoder(encoded)
print(len(output))
If the input sequence length is 5 and the model is trained to translate to a sequence of length 7, what will len(output) print?
medium
A. 5
B. Cannot determine without more info
C. 12
D. 7

Solution

  1. Step 1: Understand input and output lengths

    The input sequence length is 5, but the model is trained to produce output sequences of length 7.
  2. Step 2: Recognize decoder output length

    The decoder generates output sequences based on training, so output length should be 7 regardless of input length.
  3. Final Answer:

    7 -> Option D
  4. Quick Check:

    Output length = trained target length = 7 [OK]
Hint: Output length matches target, not input length [OK]
Common Mistakes:
  • Assuming output length equals input length
  • Adding input and output lengths
  • Saying output length is unknown
4. You have this code snippet for a sequence-to-sequence model training step:
for input_seq, target_seq in dataset:
    encoded = encoder(input_seq)
    output = decoder(encoded)
    loss = loss_function(output, target_seq)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
What is the likely error in this code?
medium
A. optimizer.zero_grad() should be called before loss.backward()
B. optimizer.step() should be called before loss.backward()
C. Missing call to optimizer.zero_grad() before loss.backward()
D. optimizer.zero_grad() should be called before optimizer.step()

Solution

  1. Step 1: Recall training step order

    Gradients must be cleared before computing new gradients with loss.backward().
  2. Step 2: Identify correct zero_grad() placement

    optimizer.zero_grad() should be called before loss.backward(), not after optimizer.step().
  3. Final Answer:

    Missing call to optimizer.zero_grad() before loss.backward() -> Option C
  4. Quick Check:

    Clear grads before backward pass [OK]
Hint: Call zero_grad() before backward() [OK]
Common Mistakes:
  • Calling zero_grad() after backward()
  • Calling optimizer.step() before backward()
  • Skipping zero_grad() entirely
5. In a sequence-to-sequence model for language translation, why might adding an attention mechanism improve performance?
hard
A. It allows the decoder to focus on relevant parts of the input sequence dynamically
B. It reduces the size of the input sequence to a fixed vector
C. It speeds up training by skipping the encoder step
D. It replaces the decoder with a simpler model

Solution

  1. Step 1: Understand attention's purpose

    Attention helps the decoder look at different parts of the input sequence when generating each output token.
  2. Step 2: Compare with fixed vector encoding

    Without attention, the encoder compresses input into one fixed vector, which can lose details.
  3. Step 3: Eliminate incorrect options

    Attention does not reduce input size, skip encoder, or replace decoder; it enhances focus during decoding.
  4. Final Answer:

    It allows the decoder to focus on relevant parts of the input sequence dynamically -> Option A
  5. Quick Check:

    Attention = dynamic focus on input [OK]
Hint: Attention helps decoder focus on input parts [OK]
Common Mistakes:
  • Thinking attention reduces input size
  • Believing attention skips encoder
  • Assuming attention replaces decoder