0
0
ML Pythonml~20 mins

Bag of Words and TF-IDF in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Bag of Words and TF-IDF
Problem:You want to classify movie reviews as positive or negative using text data. Currently, you use a Bag of Words model with a simple logistic regression classifier.
Current Metrics:Training accuracy: 95%, Validation accuracy: 70%
Issue:The model is overfitting. Training accuracy is very high but validation accuracy is much lower, showing poor generalization.
Your Task
Reduce overfitting by improving the text representation to increase validation accuracy to at least 80% while keeping training accuracy below 90%.
You must keep using logistic regression as the classifier.
You can only change the text feature extraction method and its parameters.
Do not change the dataset or add more data.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('aclImdb/train/', categories=['pos', 'neg'], shuffle=True, random_state=42)
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert bytes to strings
X_train = [doc.decode('utf-8') for doc in X_train]
X_val = [doc.decode('utf-8') for doc in X_val]

# Use TF-IDF vectorizer with stop words removal and max features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Train logistic regression
model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train_tfidf, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_tfidf)
val_preds = model.predict(X_val_tfidf)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Replaced CountVectorizer (Bag of Words) with TfidfVectorizer to weigh words by importance.
Added stop words removal to ignore common words.
Limited features to top 5000 to reduce noise and overfitting.
Included bigrams (2-word sequences) to capture more context.
Results Interpretation

Before: Training accuracy: 95%, Validation accuracy: 70% (overfitting)

After: Training accuracy: 88.5%, Validation accuracy: 81.2% (better generalization)

Using TF-IDF with stop words removal and feature limits helps reduce overfitting by focusing on important words and ignoring noise, improving validation accuracy.
Bonus Experiment
Try adding n-grams of length 3 (trigrams) and see if validation accuracy improves further without increasing overfitting.
💡 Hint
Increase ngram_range to (1,3) in TfidfVectorizer and observe changes in accuracy.