NLPml~20 mins

Bag of Words (CountVectorizer) in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Bag of Words (CountVectorizer)

Problem:We want to classify movie reviews as positive or negative using a simple Bag of Words model with CountVectorizer and a logistic regression classifier.

Current Metrics:Training accuracy: 98%, Validation accuracy: 70%

Issue:The model is overfitting: it performs very well on training data but poorly on validation data.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.

You can only change the preprocessing and model hyperparameters.

Do not change the dataset or use a different model.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('aclImdb/train/', categories=['pos', 'neg'], shuffle=True, random_state=42)
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with CountVectorizer and LogisticRegression
pipeline = make_pipeline(
    CountVectorizer(max_features=5000, ngram_range=(1,2), stop_words='english'),
    LogisticRegression(max_iter=200, C=1.0, penalty='l2', solver='liblinear', random_state=42)
)

# Train model
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Limited vocabulary size to 5000 most frequent words to reduce noise.

Used unigrams and bigrams to capture some word context.

Removed English stop words to ignore common irrelevant words.

Added L2 regularization in logistic regression to reduce overfitting.

Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 70% (high overfitting)

After: Training accuracy: 90.5%, Validation accuracy: 86.3% (reduced overfitting, better generalization)

Reducing vocabulary size, removing stop words, using n-grams, and adding regularization helps reduce overfitting in text classification with Bag of Words.

Bonus Experiment

Try using TF-IDF vectorizer instead of CountVectorizer and compare the validation accuracy.

💡 Hint

TF-IDF weighs words by importance, which can improve model focus on meaningful words and reduce noise.

Practice

(1/5)

1. What does the Bag of Words model do in text processing?

easy

A. Counts how often each word appears in the text

B. Translates text into another language

C. Removes all punctuation from the text

D. Generates summaries of the text

Bag of Words (CountVectorizer) in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand Bag of Words purpose

Step 2: Compare options to definition

Final Answer:

Quick Check:

Solution

Step 1: Recall correct import path

Step 2: Match options to correct syntax

Final Answer:

Quick Check:

Solution

Step 1: Identify unique words

Step 2: Count sentences and features

Final Answer:

Quick Check:

Solution

Step 1: Identify deprecated method

Step 2: Use correct method

Final Answer:

Quick Check:

Solution

Step 1: Understand max_df parameter

Step 2: Compare other options

Final Answer:

Quick Check: