NlpHow-ToBeginner · 4 min read

How to Do Multiclass Text Classification in NLP Easily

To do multiclass text classification in NLP, you first convert text into numbers using techniques like TF-IDF or word embeddings. Then, you train a model such as Logistic Regression or Random Forest that can predict one label out of many classes based on the text features.

📐

Syntax

Multiclass text classification typically involves these steps:

Text Vectorization: Convert text into numeric features using CountVectorizer or TfidfVectorizer.
Model Training: Use a classifier like LogisticRegression with multi_class='multinomial' or RandomForestClassifier.
Prediction: Use the trained model to predict the class label for new text.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Step 1: Vectorize text
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)

# Step 2: Train model
model = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
model.fit(X_train, train_labels)

# Step 3: Predict
X_test = vectorizer.transform(test_texts)
predictions = model.predict(X_test)

💻

Example

This example shows how to classify movie reviews into three categories: positive, neutral, and negative using scikit-learn. It uses TfidfVectorizer to convert text and LogisticRegression for classification.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample data
texts = [
    'I love this movie, it is fantastic!',
    'This film is okay, not great but not bad.',
    'I hated this movie, it was terrible.',
    'What a wonderful experience, truly amazing!',
    'Mediocre plot and average acting.',
    'Worst movie I have ever seen.',
    'Pretty good, I enjoyed it.',
    'Not my taste, but it was fine.',
    'Awful, do not waste your time.',
    'Excellent story and great characters.'
]
labels = ['positive', 'neutral', 'negative', 'positive', 'neutral', 'negative', 'positive', 'neutral', 'negative', 'positive']

# Split data
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

# Vectorize text
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train model
model = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
model.fit(X_train_vec, y_train)

# Predict
y_pred = model.predict(X_test_vec)

# Show results
print(classification_report(y_test, y_pred))

Output

precision recall f1-score support negative 1.00 1.00 1.00 1 neutral 1.00 1.00 1.00 1 positive 1.00 1.00 1.00 1 accuracy 1.00 3 macro avg 1.00 1.00 1.00 3 weighted avg 1.00 1.00 1.00 3

⚠️

Common Pitfalls

Common mistakes when doing multiclass text classification include:

Not preprocessing text (like removing stopwords or lowercasing), which can reduce model accuracy.
Using a binary classifier instead of a multiclass one, causing errors or wrong predictions.
Ignoring class imbalance, which can bias the model toward majority classes.
Not tuning model parameters like max_iter or solver in logistic regression, leading to poor convergence.

python

from sklearn.linear_model import LogisticRegression

# Wrong: Using binary logistic regression without multiclass setting
model_wrong = LogisticRegression()
# This may cause warnings or errors if labels > 2

# Right: Specify multiclass and solver
model_right = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)

📊

Quick Reference

Tips for multiclass text classification:

Use TfidfVectorizer or CountVectorizer to convert text to numbers.
Choose classifiers that support multiclass like LogisticRegression with multi_class='multinomial', RandomForestClassifier, or MultinomialNB.
Split data into train/test sets to evaluate performance.
Use metrics like accuracy, precision, recall, and f1-score for all classes.
Preprocess text by lowercasing and removing noise for better results.

✅

Key Takeaways

Convert text into numeric features using vectorizers like TF-IDF before training.

Use classifiers that support multiclass, such as Logistic Regression with multinomial option.

Always preprocess text and handle class imbalance for better model accuracy.

Evaluate your model with metrics that consider all classes, like f1-score per class.

Set proper model parameters to ensure training convergence and good performance.