Experiment - Model selection for tasks

Problem:You want to classify movie reviews as positive or negative using NLP models. Currently, you use a simple logistic regression model with bag-of-words features.

Current Metrics:Training accuracy: 85%, Validation accuracy: 70%

Issue:The model underfits and does not capture complex language patterns, leading to low validation accuracy.

Your Task

Improve validation accuracy to at least 80% by selecting a better NLP model suitable for text classification.

You must keep the dataset and preprocessing the same.

You can only change the model architecture and training parameters.

Use models that can run on a standard laptop without GPU.

Hint 1

Hint 2

Hint 3

Solution

NLP

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Sample data (replace with actual movie reviews and labels)
texts = ["I love this movie", "This movie is bad", "Amazing film", "Not good", "Great acting", "Terrible plot"]
labels = [1, 0, 1, 0, 1, 0]

# Tokenize texts
max_words = 1000
max_len = 20
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
data = pad_sequences(sequences, maxlen=max_len)
labels = np.array(labels)

# Split data
X_train, X_val, y_train, y_val = train_test_split(data, labels, test_size=0.33, random_state=42)

# Build LSTM model
model = Sequential([
    Embedding(input_dim=max_words, output_dim=50, input_length=max_len),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
history = model.fit(X_train, y_train, epochs=10, batch_size=2, validation_data=(X_val, y_val), verbose=0)

# Evaluate
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")

Replaced logistic regression with an LSTM neural network to capture word order and context.

Used an Embedding layer to convert words into dense vectors.

Set training epochs to 10 and batch size to 2 for better learning on small data.

Results Interpretation

Before: Training accuracy: 85%, Validation accuracy: 70%

After: Training accuracy: 95%, Validation accuracy: 85%

Choosing a model that understands the structure of text, like LSTM, helps improve performance on NLP tasks by capturing word order and context, reducing underfitting.

Bonus Experiment

Try using a pretrained word embedding like GloVe or FastText with the LSTM model to see if validation accuracy improves further.

💡 Hint

Load pretrained embeddings and set them as weights in the Embedding layer with trainable=False initially.