NlpHow-ToBeginner · 4 min read

How to Use SVM for Text Classification in NLP

To use SVM for text classification in NLP, first convert text data into numerical features using methods like TF-IDF. Then, train an SVM model on these features to classify text into categories based on learned patterns.

📐

Syntax

Using SVM for text classification involves these main steps:

Text Vectorization: Convert text into numbers using TfidfVectorizer.
Model Training: Use sklearn.svm.SVC or LinearSVC to train the classifier.
Prediction: Use the trained model to predict labels for new text.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

# Step 1: Convert text to features
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(texts_train)

# Step 2: Train SVM model
model = LinearSVC()
model.fit(X_train, labels_train)

# Step 3: Predict on new data
X_test = vectorizer.transform(texts_test)
predictions = model.predict(X_test)

💻

Example

This example shows how to classify movie reviews as positive or negative using SVM and TF-IDF vectorization.

python

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Sample data: two categories
categories = ['rec.autos', 'sci.med']
data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

# Vectorize text
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)

# Train SVM
model = LinearSVC()
model.fit(X_train, data_train.target)

# Predict and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(data_test.target, predictions)
print(f"Accuracy: {accuracy:.2f}")

Output

Accuracy: 0.92

⚠️

Common Pitfalls

Not preprocessing text: Raw text with noise can reduce accuracy. Use stop word removal and lowercasing.
Using default SVM without tuning: Parameters like C affect performance; try tuning them.
Ignoring feature scaling: TF-IDF usually works well, but inconsistent scaling can hurt SVM.
Using SVM with very large datasets: SVM can be slow; consider LinearSVC or other classifiers.

python

from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer

# Wrong: Using raw counts without TF-IDF and default SVC
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(texts_train)
model = SVC()  # slower and may overfit
model.fit(X_train, labels_train)

# Right: Use TF-IDF and LinearSVC for better speed and performance
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(texts_train)
model = LinearSVC(C=1.0)
model.fit(X_train, labels_train)

📊

Quick Reference

Tips for using SVM in NLP:

Always convert text to numerical features (TF-IDF is preferred).
Use LinearSVC for faster training on text data.
Tune the regularization parameter C to balance bias and variance.
Preprocess text: lowercase, remove stop words, and clean punctuation.
Evaluate with accuracy or F1-score depending on class balance.

✅

Key Takeaways

Convert text to numerical features using TF-IDF before applying SVM.

Use LinearSVC for efficient and effective text classification.

Preprocess text data to improve model accuracy.

Tune SVM parameters like C for better performance.

Evaluate model predictions with accuracy or F1-score.