0
0
NLPml~20 mins

Bag of Words (CountVectorizer) in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Bag of Words (CountVectorizer)
Problem:We want to classify movie reviews as positive or negative using a simple Bag of Words model with CountVectorizer and a logistic regression classifier.
Current Metrics:Training accuracy: 98%, Validation accuracy: 70%
Issue:The model is overfitting: it performs very well on training data but poorly on validation data.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.
You can only change the preprocessing and model hyperparameters.
Do not change the dataset or use a different model.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('aclImdb/train/', categories=['pos', 'neg'], shuffle=True, random_state=42)
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with CountVectorizer and LogisticRegression
pipeline = make_pipeline(
    CountVectorizer(max_features=5000, ngram_range=(1,2), stop_words='english'),
    LogisticRegression(max_iter=200, C=1.0, penalty='l2', solver='liblinear', random_state=42)
)

# Train model
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Limited vocabulary size to 5000 most frequent words to reduce noise.
Used unigrams and bigrams to capture some word context.
Removed English stop words to ignore common irrelevant words.
Added L2 regularization in logistic regression to reduce overfitting.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 70% (high overfitting)

After: Training accuracy: 90.5%, Validation accuracy: 86.3% (reduced overfitting, better generalization)

Reducing vocabulary size, removing stop words, using n-grams, and adding regularization helps reduce overfitting in text classification with Bag of Words.
Bonus Experiment
Try using TF-IDF vectorizer instead of CountVectorizer and compare the validation accuracy.
💡 Hint
TF-IDF weighs words by importance, which can improve model focus on meaningful words and reduce noise.