How to do text classification sklearn in python

MlopsHow-ToBeginner · 3 min read

Text Classification with sklearn in Python: Simple Guide

To do text classification in Python with sklearn, use CountVectorizer or TfidfVectorizer to convert text into numbers, then train a classifier like LogisticRegression. Fit the vectorizer and model on training data, then predict labels on new text.

📐

Syntax

Text classification in sklearn typically involves these steps:

Vectorizer: Convert text to numeric features using CountVectorizer or TfidfVectorizer.
Classifier: Use a model like LogisticRegression, MultinomialNB, or SVC to learn from features.
Fit: Train the vectorizer and classifier on labeled text data.
Predict: Use the trained model to classify new text.

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = CountVectorizer()
classifier = LogisticRegression()

X_train_counts = vectorizer.fit_transform(train_texts)
classifier.fit(X_train_counts, train_labels)

X_test_counts = vectorizer.transform(test_texts)
predictions = classifier.predict(X_test_counts)

💻

Example

This example shows how to classify simple text messages as spam or not spam using sklearn.

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
texts = [
    'Free money now!!!',
    'Hi, how are you?',
    'Win a free lottery ticket',
    'Hello friend, long time no see',
    'Claim your free prize',
    'Are we meeting today?'
]
labels = [1, 0, 1, 0, 1, 0]  # 1 = spam, 0 = not spam

# Split data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42)

# Vectorize text
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)

# Train classifier
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train_counts, y_train)

# Predict
X_test_counts = vectorizer.transform(X_test)
y_pred = classifier.predict(X_test_counts)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Predictions:', y_pred.tolist())

Output

Accuracy: 1.00 Predictions: [0, 1]

⚠️

Common Pitfalls

Common mistakes when doing text classification with sklearn include:

Not fitting the vectorizer on training data before transforming test data, causing errors or poor results.
Using raw text directly without vectorization, which sklearn models cannot handle.
Ignoring data splitting, leading to overfitting and misleading accuracy.
Not preprocessing text (like lowercasing) which can reduce model quality.

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

texts = ['Hello world', 'Free money']
labels = [0, 1]

vectorizer = CountVectorizer()
# Wrong: transforming test data before fitting vectorizer
try:
    X_test_counts = vectorizer.transform(texts)
except Exception as e:
    print('Error:', e)

# Right way:
X_train_counts = vectorizer.fit_transform(texts)
classifier = LogisticRegression()
classifier.fit(X_train_counts, labels)

Output

Error: Vocabulary not fitted or provided

📊

Quick Reference

Summary tips for sklearn text classification:

Use CountVectorizer or TfidfVectorizer to convert text to numbers.
Choose a classifier like LogisticRegression or MultinomialNB.
Always fit vectorizer on training data before transforming test data.
Split data into training and testing sets to evaluate performance.
Check accuracy or other metrics to measure model quality.

✅

Key Takeaways

Convert text to numeric features using sklearn vectorizers before classification.

Fit the vectorizer and classifier only on training data to avoid errors.

Split data into training and test sets to properly evaluate your model.

Use simple classifiers like LogisticRegression for effective text classification.

Check model accuracy to understand how well your classifier works.