Experiment - Text classification pipeline

Problem:Classify movie reviews as positive or negative using a simple text classification model.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Training loss: 0.15, Validation loss: 0.60

Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 90%.

Use the same dataset and model architecture (simple neural network with embedding and dense layers).

Do not increase training epochs beyond 10.

Do not change the dataset size.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

ML Python

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping

# Sample data
texts = ["I love this movie", "This movie is terrible", "Amazing film", "Worst movie ever", "I enjoyed it", "Not good", "Fantastic acting", "Bad plot"]
labels = [1, 0, 1, 0, 1, 0, 1, 0]

# Tokenize and pad sequences
max_words = 1000
max_len = 10

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
data = pad_sequences(sequences, maxlen=max_len)

# Convert labels to tensor
labels = tf.convert_to_tensor(labels)

# Build model with dropout
model = Sequential([
    Embedding(max_words, 16, input_length=max_len),
    GlobalAveragePooling1D(),
    Dropout(0.5),
    Dense(16, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)

# Train model with validation split
history = model.fit(data, labels, epochs=10, batch_size=2, validation_split=0.25, callbacks=[early_stop])

# Evaluate final metrics
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

print(f"Training accuracy: {train_acc:.2f}%, Validation accuracy: {val_acc:.2f}%")
print(f"Training loss: {train_loss:.3f}, Validation loss: {val_loss:.3f}")

Added Dropout layers after embedding and dense layers to reduce overfitting.

Added EarlyStopping callback to stop training when validation loss stops improving.

Used validation_split=0.25 during training to monitor validation performance.

Kept model architecture simple but added regularization.

Results Interpretation

Before: Training accuracy: 95%, Validation accuracy: 70%, Training loss: 0.15, Validation loss: 0.60

After: Training accuracy: 88%, Validation accuracy: 86%, Training loss: 0.30, Validation loss: 0.35

Adding dropout and early stopping helps reduce overfitting by preventing the model from memorizing training data, leading to better validation accuracy and more balanced training.

Bonus Experiment

Try using a simpler model by reducing the number of neurons in the dense layer to 8 and see if validation accuracy improves further.

💡 Hint

Reducing model complexity can help prevent overfitting by limiting the model's capacity to memorize training data.