How to Do Text Classification in Python for NLP
To do
text classification in Python for NLP, you first convert text into numbers using techniques like TF-IDF or CountVectorizer, then train a machine learning model such as LogisticRegression or MultinomialNB on labeled data. Finally, you use the trained model to predict categories for new text inputs.Syntax
Text classification in Python typically involves these steps:
- Text vectorization: Convert text to numeric features using
CountVectorizerorTfidfVectorizer. - Model training: Use a classifier like
LogisticRegressionorMultinomialNBfromsklearn. - Prediction: Apply the trained model to new text data to get predicted labels.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # Step 1: Convert text to features vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(train_texts) # Step 2: Train model model = LogisticRegression(max_iter=1000) model.fit(X_train, train_labels) # Step 3: Predict X_test = vectorizer.transform(test_texts) predictions = model.predict(X_test)
Example
This example shows how to classify movie reviews as positive or negative using sklearn. It uses TfidfVectorizer to convert text and LogisticRegression to train the model.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample data texts = [ 'I love this movie', 'This film was terrible', 'Amazing story and great acting', 'I did not like the movie', 'Best movie ever', 'Worst film I have seen' ] labels = [1, 0, 1, 0, 1, 0] # 1=positive, 0=negative # Split data train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.33, random_state=42) # Vectorize text vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(train_texts) X_test = vectorizer.transform(test_texts) # Train model model = LogisticRegression(max_iter=1000) model.fit(X_train, train_labels) # Predict predictions = model.predict(X_test) # Evaluate accuracy = accuracy_score(test_labels, predictions) print(f'Accuracy: {accuracy:.2f}') print('Predictions:', predictions.tolist())
Output
Accuracy: 1.00
Predictions: [0, 1]
Common Pitfalls
Common mistakes when doing text classification include:
- Not preprocessing text (like lowercasing or removing punctuation) which can reduce accuracy.
- Using the same data for training and testing, causing overly optimistic results.
- Ignoring class imbalance, which can bias the model toward the majority class.
- Not tuning model parameters or vectorizer settings for better performance.
python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score texts = ['Good movie', 'Bad movie', 'Excellent film', 'Terrible film'] labels = [1, 0, 1, 0] # Wrong: No train/test split vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression(max_iter=1000) model.fit(X, labels) predictions = model.predict(X) print('Accuracy without split:', accuracy_score(labels, predictions)) # Right: With train/test split train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.5, random_state=1) vectorizer = CountVectorizer() X_train = vectorizer.fit_transform(train_texts) X_test = vectorizer.transform(test_texts) model = LogisticRegression(max_iter=1000) model.fit(X_train, train_labels) predictions = model.predict(X_test) print('Accuracy with split:', accuracy_score(test_labels, predictions))
Output
Accuracy without split: 1.0
Accuracy with split: 1.0
Quick Reference
Tips for effective text classification in Python:
- Always split data into training and testing sets.
- Use
TfidfVectorizerfor better feature representation than simple counts. - Try simple models like
LogisticRegressionorMultinomialNBfirst. - Preprocess text by lowercasing and removing noise.
- Evaluate model with accuracy or other metrics like F1-score.
Key Takeaways
Convert text to numeric features using vectorizers like TfidfVectorizer before training.
Train a simple classifier such as LogisticRegression on labeled text data.
Always split data into training and testing sets to evaluate model performance fairly.
Preprocess text to improve model accuracy by normalizing and cleaning input.
Check model accuracy or other metrics to understand classification quality.
