NLPml~20 mins

SVM for text classification in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - SVM for text classification

Problem:Classify movie reviews as positive or negative using a Support Vector Machine (SVM) model on text data.

Current Metrics:Training accuracy: 98%, Validation accuracy: 75%

Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 90%.

Use the SVM model with text data vectorized by TF-IDF.

Do not change the dataset or use a different model type.

Hint 1

Hint 2

Hint 3

Solution

NLP

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
reviews = load_files('aclImdb/train/', categories=['pos', 'neg'], shuffle=True, random_state=42)
X, y = reviews.data, reviews.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with TF-IDF and LinearSVC
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
    ('svm', LinearSVC())
])

# Hyperparameter tuning with GridSearchCV
param_grid = {'svm__C': [0.01, 0.1, 1, 10]}
grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Predictions
train_preds = best_model.predict(X_train)
val_preds = best_model.predict(X_val)

# Metrics
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Added TF-IDF vectorizer with max_features=5000 to limit input size and reduce noise.

Used GridSearchCV to find the best regularization parameter C for the SVM.

Reduced model complexity by tuning C to prevent overfitting.

Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 75%
After: Training accuracy: 89.5%, Validation accuracy: 86.2%

Reducing model complexity and tuning hyperparameters like the regularization parameter C helps reduce overfitting. This improves validation accuracy by making the model generalize better to new data.

Bonus Experiment

Try using n-grams (e.g., bigrams) in the TF-IDF vectorizer to see if it improves validation accuracy further.

💡 Hint

Set the 'ngram_range' parameter in TfidfVectorizer to (1, 2) and rerun the grid search.

Practice

(1/5)

1. What is the main purpose of using an SVM (Support Vector Machine) in text classification?

easy

A. To find the best line that separates different text categories

B. To count the number of words in the text

C. To translate text into another language

D. To generate random text samples

SVM for text classification in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand SVM's role in classification

Step 2: Apply this to text classification

Final Answer:

Quick Check:

Solution

Step 1: Identify text preprocessing for SVM

Step 2: Check other options

Final Answer:

Quick Check:

Solution

Step 1: Understand training labels and texts

Step 2: Predict new texts

Final Answer:

Quick Check:

Solution

Step 1: Analyze the error message

Step 2: Identify cause in text classification

Final Answer:

Quick Check:

Solution

Step 1: Understand the problem with common words

Step 2: Choose vectorization method to reduce common word impact

Step 3: Evaluate other options

Final Answer:

Quick Check: