0
0
NLPml~20 mins

SVM for text classification in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - SVM for text classification
Problem:Classify movie reviews as positive or negative using a Support Vector Machine (SVM) model on text data.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 90%.
Use the SVM model with text data vectorized by TF-IDF.
Do not change the dataset or use a different model type.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('aclImdb/train/', categories=['pos', 'neg'], shuffle=True, random_state=42)
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with TF-IDF and LinearSVC
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
    ('svm', LinearSVC())
])

# Hyperparameter tuning with GridSearchCV
param_grid = {'svm__C': [0.01, 0.1, 1, 10]}
grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Predictions
train_preds = best_model.predict(X_train)
val_preds = best_model.predict(X_val)

# Metrics
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added TF-IDF vectorizer with max_features=5000 to limit input size and reduce noise.
Used GridSearchCV to find the best regularization parameter C for the SVM.
Reduced model complexity by tuning C to prevent overfitting.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 75%
After: Training accuracy: 89.5%, Validation accuracy: 86.2%

Reducing model complexity and tuning hyperparameters like the regularization parameter C helps reduce overfitting. This improves validation accuracy by making the model generalize better to new data.
Bonus Experiment
Try using n-grams (e.g., bigrams) in the TF-IDF vectorizer to see if it improves validation accuracy further.
💡 Hint
Set the 'ngram_range' parameter in TfidfVectorizer to (1, 2) and rerun the grid search.