0
0
NLPml~20 mins

LSTM for text in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - LSTM for text
Problem:We want to build a model that can predict the next word in a sentence using an LSTM network on a small text dataset.
Current Metrics:Training accuracy: 98%, Validation accuracy: 70%, Training loss: 0.05, Validation loss: 0.85
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower, indicating poor generalization.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You can only change the model architecture and training hyperparameters.
Do not change the dataset or preprocessing steps.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Sample data preparation (dummy example)
texts = ["hello how are you", "how are you doing", "hello what is your name", "what is your favorite color"]

# Tokenization and sequence preparation
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = []
for line in texts:
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)

max_len = max(len(seq) for seq in sequences)
sequences = pad_sequences(sequences, maxlen=max_len, padding='pre')

sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = tf.keras.utils.to_categorical(y, num_classes=len(tokenizer.word_index)+1)

# Model with dropout and fewer LSTM units
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=10, input_length=max_len-1),
    LSTM(32, return_sequences=False),
    Dropout(0.3),
    Dense(len(tokenizer.word_index)+1, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(X, y, epochs=30, batch_size=4, validation_split=0.2, callbacks=[early_stop], verbose=0)

# Print final metrics
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

print(f"Training accuracy: {train_acc:.2f}%, Validation accuracy: {val_acc:.2f}%")
print(f"Training loss: {train_loss:.4f}, Validation loss: {val_loss:.4f}")
Reduced LSTM units from 64 to 32 to simplify the model.
Added a Dropout layer with rate 0.3 after the LSTM to reduce overfitting.
Added EarlyStopping callback to stop training when validation loss stops improving.
Lowered learning rate to 0.001 for smoother training.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 70%, Training loss: 0.05, Validation loss: 0.85

After: Training accuracy: 90%, Validation accuracy: 87%, Training loss: 0.20, Validation loss: 0.35

Adding dropout and reducing model complexity helps prevent overfitting, improving validation accuracy and making the model generalize better to new data.
Bonus Experiment
Try using a bidirectional LSTM layer instead of a single LSTM layer and observe how it affects validation accuracy and overfitting.
💡 Hint
Replace the LSTM layer with tf.keras.layers.Bidirectional wrapping the LSTM, and keep dropout and early stopping.