0
0
ML Pythonml~20 mins

Sentiment analysis with scikit-learn in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Sentiment analysis with scikit-learn
Problem:Classify movie reviews as positive or negative using a simple text model.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
Use scikit-learn only.
Keep the same dataset and model type (text vectorization + logistic regression).
Do not add external data.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
ML Python
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('aclImdb/train/', categories=['pos', 'neg'], shuffle=True, random_state=42)
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline with vectorizer and logistic regression
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1,2))),
    ('clf', LogisticRegression(max_iter=200, solver='liblinear'))
])

# Hyperparameter tuning for regularization strength
param_grid = {
    'clf__C': [0.01, 0.1, 1, 10]
}

grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best model evaluation
y_train_pred = grid.predict(X_train)
y_val_pred = grid.predict(X_val)

train_acc = accuracy_score(y_train, y_train_pred) * 100
val_acc = accuracy_score(y_val, y_val_pred) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added stop words removal in TfidfVectorizer to reduce noise.
Limited max_features to 5000 to reduce model complexity.
Used bigrams (1,2) to capture more context.
Added logistic regression regularization with hyperparameter tuning for C.
Used GridSearchCV with 5-fold cross-validation to find best regularization strength.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 75% (overfitting)

After: Training accuracy: 90.5%, Validation accuracy: 86.3% (better generalization)

Adding regularization and limiting model complexity helps reduce overfitting, improving validation accuracy while slightly lowering training accuracy.
Bonus Experiment
Try using a different classifier like a Support Vector Machine (SVM) with similar preprocessing and compare results.
💡 Hint
Use sklearn.svm.SVC with linear kernel and tune the C parameter similarly.