Experiment - Bag of Words and TF-IDF

Problem:You want to classify movie reviews as positive or negative using text data. Currently, you use a Bag of Words model with a simple logistic regression classifier.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%

Issue:The model is overfitting. Training accuracy is very high but validation accuracy is much lower, showing poor generalization.

Your Task

Reduce overfitting by improving the text representation to increase validation accuracy to at least 80% while keeping training accuracy below 90%.

You must keep using logistic regression as the classifier.

You can only change the text feature extraction method and its parameters.

Do not change the dataset or add more data.

Hint 1

Hint 2

Hint 3

Solution

ML Python

from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('aclImdb/train/', categories=['pos', 'neg'], shuffle=True, random_state=42)
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert bytes to strings
X_train = [doc.decode('utf-8') for doc in X_train]
X_val = [doc.decode('utf-8') for doc in X_val]

# Use TF-IDF vectorizer with stop words removal and max features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Train logistic regression
model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train_tfidf, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_tfidf)
val_preds = model.predict(X_val_tfidf)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Replaced CountVectorizer (Bag of Words) with TfidfVectorizer to weigh words by importance.

Added stop words removal to ignore common words.

Limited features to top 5000 to reduce noise and overfitting.

Included bigrams (2-word sequences) to capture more context.

Results Interpretation

Before: Training accuracy: 95%, Validation accuracy: 70% (overfitting)

After: Training accuracy: 88.5%, Validation accuracy: 81.2% (better generalization)

Using TF-IDF with stop words removal and feature limits helps reduce overfitting by focusing on important words and ignoring noise, improving validation accuracy.

Bonus Experiment

Try adding n-grams of length 3 (trigrams) and see if validation accuracy improves further without increasing overfitting.

💡 Hint

Increase ngram_range to (1,3) in TfidfVectorizer and observe changes in accuracy.