Text Classification with sklearn in Python: Simple Guide
To do text classification in Python with
sklearn, use CountVectorizer or TfidfVectorizer to convert text into numbers, then train a classifier like LogisticRegression. Fit the vectorizer and model on training data, then predict labels on new text.Syntax
Text classification in sklearn typically involves these steps:
- Vectorizer: Convert text to numeric features using
CountVectorizerorTfidfVectorizer. - Classifier: Use a model like
LogisticRegression,MultinomialNB, orSVCto learn from features. - Fit: Train the vectorizer and classifier on labeled text data.
- Predict: Use the trained model to classify new text.
python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression vectorizer = CountVectorizer() classifier = LogisticRegression() X_train_counts = vectorizer.fit_transform(train_texts) classifier.fit(X_train_counts, train_labels) X_test_counts = vectorizer.transform(test_texts) predictions = classifier.predict(X_test_counts)
Example
This example shows how to classify simple text messages as spam or not spam using sklearn.
python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample data texts = [ 'Free money now!!!', 'Hi, how are you?', 'Win a free lottery ticket', 'Hello friend, long time no see', 'Claim your free prize', 'Are we meeting today?' ] labels = [1, 0, 1, 0, 1, 0] # 1 = spam, 0 = not spam # Split data X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42) # Vectorize text vectorizer = CountVectorizer() X_train_counts = vectorizer.fit_transform(X_train) # Train classifier classifier = LogisticRegression(max_iter=1000) classifier.fit(X_train_counts, y_train) # Predict X_test_counts = vectorizer.transform(X_test) y_pred = classifier.predict(X_test_counts) # Accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}') print('Predictions:', y_pred.tolist())
Output
Accuracy: 1.00
Predictions: [0, 1]
Common Pitfalls
Common mistakes when doing text classification with sklearn include:
- Not fitting the vectorizer on training data before transforming test data, causing errors or poor results.
- Using raw text directly without vectorization, which sklearn models cannot handle.
- Ignoring data splitting, leading to overfitting and misleading accuracy.
- Not preprocessing text (like lowercasing) which can reduce model quality.
python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression texts = ['Hello world', 'Free money'] labels = [0, 1] vectorizer = CountVectorizer() # Wrong: transforming test data before fitting vectorizer try: X_test_counts = vectorizer.transform(texts) except Exception as e: print('Error:', e) # Right way: X_train_counts = vectorizer.fit_transform(texts) classifier = LogisticRegression() classifier.fit(X_train_counts, labels)
Output
Error: Vocabulary not fitted or provided
Quick Reference
Summary tips for sklearn text classification:
- Use
CountVectorizerorTfidfVectorizerto convert text to numbers. - Choose a classifier like
LogisticRegressionorMultinomialNB. - Always fit vectorizer on training data before transforming test data.
- Split data into training and testing sets to evaluate performance.
- Check accuracy or other metrics to measure model quality.
Key Takeaways
Convert text to numeric features using sklearn vectorizers before classification.
Fit the vectorizer and classifier only on training data to avoid errors.
Split data into training and test sets to properly evaluate your model.
Use simple classifiers like LogisticRegression for effective text classification.
Check model accuracy to understand how well your classifier works.