How to Do Multiclass Text Classification in NLP Easily
To do
multiclass text classification in NLP, you first convert text into numbers using techniques like TF-IDF or word embeddings. Then, you train a model such as Logistic Regression or Random Forest that can predict one label out of many classes based on the text features.Syntax
Multiclass text classification typically involves these steps:
- Text Vectorization: Convert text into numeric features using
CountVectorizerorTfidfVectorizer. - Model Training: Use a classifier like
LogisticRegressionwithmulti_class='multinomial'orRandomForestClassifier. - Prediction: Use the trained model to predict the class label for new text.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # Step 1: Vectorize text vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(train_texts) # Step 2: Train model model = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs') model.fit(X_train, train_labels) # Step 3: Predict X_test = vectorizer.transform(test_texts) predictions = model.predict(X_test)
Example
This example shows how to classify movie reviews into three categories: positive, neutral, and negative using scikit-learn. It uses TfidfVectorizer to convert text and LogisticRegression for classification.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Sample data texts = [ 'I love this movie, it is fantastic!', 'This film is okay, not great but not bad.', 'I hated this movie, it was terrible.', 'What a wonderful experience, truly amazing!', 'Mediocre plot and average acting.', 'Worst movie I have ever seen.', 'Pretty good, I enjoyed it.', 'Not my taste, but it was fine.', 'Awful, do not waste your time.', 'Excellent story and great characters.' ] labels = ['positive', 'neutral', 'negative', 'positive', 'neutral', 'negative', 'positive', 'neutral', 'negative', 'positive'] # Split data X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42) # Vectorize text vectorizer = TfidfVectorizer() X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) # Train model model = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs') model.fit(X_train_vec, y_train) # Predict y_pred = model.predict(X_test_vec) # Show results print(classification_report(y_test, y_pred))
Output
precision recall f1-score support
negative 1.00 1.00 1.00 1
neutral 1.00 1.00 1.00 1
positive 1.00 1.00 1.00 1
accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3
Common Pitfalls
Common mistakes when doing multiclass text classification include:
- Not preprocessing text (like removing stopwords or lowercasing), which can reduce model accuracy.
- Using a binary classifier instead of a multiclass one, causing errors or wrong predictions.
- Ignoring class imbalance, which can bias the model toward majority classes.
- Not tuning model parameters like
max_iterorsolverin logistic regression, leading to poor convergence.
python
from sklearn.linear_model import LogisticRegression # Wrong: Using binary logistic regression without multiclass setting model_wrong = LogisticRegression() # This may cause warnings or errors if labels > 2 # Right: Specify multiclass and solver model_right = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
Quick Reference
Tips for multiclass text classification:
- Use
TfidfVectorizerorCountVectorizerto convert text to numbers. - Choose classifiers that support multiclass like
LogisticRegressionwithmulti_class='multinomial',RandomForestClassifier, orMultinomialNB. - Split data into train/test sets to evaluate performance.
- Use metrics like accuracy, precision, recall, and f1-score for all classes.
- Preprocess text by lowercasing and removing noise for better results.
Key Takeaways
Convert text into numeric features using vectorizers like TF-IDF before training.
Use classifiers that support multiclass, such as Logistic Regression with multinomial option.
Always preprocess text and handle class imbalance for better model accuracy.
Evaluate your model with metrics that consider all classes, like f1-score per class.
Set proper model parameters to ensure training convergence and good performance.
