How to do text classification python in nlp

NlpHow-ToBeginner · 4 min read

How to Do Text Classification in Python for NLP

To do text classification in Python for NLP, you first convert text into numbers using techniques like TF-IDF or CountVectorizer, then train a machine learning model such as LogisticRegression or MultinomialNB on labeled data. Finally, you use the trained model to predict categories for new text inputs.

📐

Syntax

Text classification in Python typically involves these steps:

Text vectorization: Convert text to numeric features using CountVectorizer or TfidfVectorizer.
Model training: Use a classifier like LogisticRegression or MultinomialNB from sklearn.
Prediction: Apply the trained model to new text data to get predicted labels.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Step 1: Convert text to features
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)

# Step 2: Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, train_labels)

# Step 3: Predict
X_test = vectorizer.transform(test_texts)
predictions = model.predict(X_test)

💻

Example

This example shows how to classify movie reviews as positive or negative using sklearn. It uses TfidfVectorizer to convert text and LogisticRegression to train the model.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
texts = [
    'I love this movie',
    'This film was terrible',
    'Amazing story and great acting',
    'I did not like the movie',
    'Best movie ever',
    'Worst film I have seen'
]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Split data
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.33, random_state=42)

# Vectorize text
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, train_labels)

# Predict
predictions = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(test_labels, predictions)
print(f'Accuracy: {accuracy:.2f}')
print('Predictions:', predictions.tolist())

Output

Accuracy: 1.00 Predictions: [0, 1]

⚠️

Common Pitfalls

Common mistakes when doing text classification include:

Not preprocessing text (like lowercasing or removing punctuation) which can reduce accuracy.
Using the same data for training and testing, causing overly optimistic results.
Ignoring class imbalance, which can bias the model toward the majority class.
Not tuning model parameters or vectorizer settings for better performance.

python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

texts = ['Good movie', 'Bad movie', 'Excellent film', 'Terrible film']
labels = [1, 0, 1, 0]

# Wrong: No train/test split
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = LogisticRegression(max_iter=1000)
model.fit(X, labels)
predictions = model.predict(X)
print('Accuracy without split:', accuracy_score(labels, predictions))

# Right: With train/test split
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.5, random_state=1)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, train_labels)
predictions = model.predict(X_test)
print('Accuracy with split:', accuracy_score(test_labels, predictions))

Output

Accuracy without split: 1.0 Accuracy with split: 1.0

📊

Quick Reference

Tips for effective text classification in Python:

Always split data into training and testing sets.
Use TfidfVectorizer for better feature representation than simple counts.
Try simple models like LogisticRegression or MultinomialNB first.
Preprocess text by lowercasing and removing noise.
Evaluate model with accuracy or other metrics like F1-score.

✅

Key Takeaways

Convert text to numeric features using vectorizers like TfidfVectorizer before training.

Train a simple classifier such as LogisticRegression on labeled text data.

Always split data into training and testing sets to evaluate model performance fairly.

Preprocess text to improve model accuracy by normalizing and cleaning input.

Check model accuracy or other metrics to understand classification quality.