0
0
NLPml~20 mins

Naive Bayes for text in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Naive Bayes for text
Problem:Classify movie reviews as positive or negative using Naive Bayes.
Current Metrics:Training accuracy: 98%, Validation accuracy: 70%
Issue:The model overfits: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
Use Naive Bayes classifier only.
You can change text preprocessing and feature extraction steps.
Do not change the dataset.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('txt_sentoken')  # Assume dataset folder with positive/negative subfolders
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Text preprocessing and feature extraction
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Train Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_tfidf)
val_preds = model.predict(X_val_tfidf)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Replaced raw count vectorizer with TF-IDF vectorizer to better represent word importance.
Removed English stop words to reduce noise.
Limited maximum features to 5000 to reduce overfitting.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 70%

After: Training accuracy: 90.5%, Validation accuracy: 86.3%

Using TF-IDF features and removing stop words reduces overfitting in Naive Bayes text classification, improving validation accuracy while lowering training accuracy.
Bonus Experiment
Try using n-grams (like bigrams) in the TF-IDF vectorizer to see if validation accuracy improves further.
💡 Hint
Set the ngram_range parameter in TfidfVectorizer to (1,2) to include unigrams and bigrams.