How to Do Multilabel Text Classification in NLP
To do
multilabel text classification in NLP, prepare your text data with multiple labels per example, convert text to features (like TF-IDF or embeddings), and train a model that supports multilabel outputs such as LogisticRegression with OneVsRestClassifier or neural networks with sigmoid activation. Use appropriate loss functions like binary_crossentropy and evaluate with metrics like f1_score for multilabel tasks.Syntax
Here is the typical syntax pattern for multilabel text classification using scikit-learn:
OneVsRestClassifier: Wraps a base classifier to handle multilabel by training one classifier per label.fit(X_train, y_train): Trains the model on text features and multilabel targets.predict(X_test): Predicts multilabel outputs for new text data.MultiLabelBinarizer: Converts label lists to binary indicator format for multilabel.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.multiclass import OneVsRestClassifier from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MultiLabelBinarizer texts = ["sample text one", "sample text two"] labels = [["label1", "label2"], ["label2"]] # Convert labels to binary format mlb = MultiLabelBinarizer() y = mlb.fit_transform(labels) # Convert text to features vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) # Create multilabel classifier model = OneVsRestClassifier(LogisticRegression(max_iter=1000)) # Train model model.fit(X, y) # Predict predictions = model.predict(X)
Example
This example shows how to train a multilabel text classifier on sample data, predict labels, and evaluate performance using F1 score.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.multiclass import OneVsRestClassifier from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MultiLabelBinarizer from sklearn.metrics import f1_score from sklearn.model_selection import train_test_split # Sample data texts = [ "I love programming in Python", "Python and Java are popular languages", "I enjoy machine learning and AI", "AI is the future of technology", "Java is used in many applications" ] labels = [ ["programming", "python"], ["programming", "java"], ["machine learning", "ai"], ["ai"], ["programming", "java"] ] # Convert labels to binary format mlb = MultiLabelBinarizer() y = mlb.fit_transform(labels) # Split data X_train, X_test, y_train, y_test = train_test_split(texts, y, test_size=0.4, random_state=42) # Vectorize text vectorizer = TfidfVectorizer() X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) # Train multilabel classifier model = OneVsRestClassifier(LogisticRegression(max_iter=1000)) model.fit(X_train_vec, y_train) # Predict y_pred = model.predict(X_test_vec) # Evaluate f1 = f1_score(y_test, y_pred, average='micro') print("Predicted labels:", mlb.inverse_transform(y_pred)) print(f"Micro F1 score: {f1:.2f}")
Output
Predicted labels: [('ai',), ('programming', 'java')]
Micro F1 score: 1.00
Common Pitfalls
Common mistakes when doing multilabel text classification include:
- Using single-label classifiers without multilabel wrappers like
OneVsRestClassifier. - Not converting label lists to binary format with
MultiLabelBinarizer. - Using softmax activation or categorical cross-entropy loss which assume single-label classification.
- Ignoring multilabel evaluation metrics like micro/macro F1 score.
python
from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder # Wrong: Using LabelEncoder for multilabel (only works for single-label) labels = [["python", "ai"], ["java"]] le = LabelEncoder() # y_wrong = le.fit_transform(labels) # This will raise an error for multilabel lists # Right: Use MultiLabelBinarizer for multilabel from sklearn.preprocessing import MultiLabelBinarizer labels_multi = [["python", "ai"], ["java"]] mlb = MultiLabelBinarizer() y_right = mlb.fit_transform(labels_multi)
Quick Reference
- Use
MultiLabelBinarizerto convert multilabel lists to binary arrays. - Use
OneVsRestClassifierwith a base classifier likeLogisticRegressionfor multilabel tasks. - Vectorize text with
TfidfVectorizeror embeddings. - Use sigmoid activation and binary cross-entropy loss in neural networks.
- Evaluate with multilabel metrics like micro/macro F1 score.
Key Takeaways
Convert multilabel targets to binary format using MultiLabelBinarizer before training.
Wrap classifiers with OneVsRestClassifier to handle multilabel classification.
Use text vectorization methods like TF-IDF to convert text into features.
Choose evaluation metrics suited for multilabel tasks, such as micro or macro F1 score.
Avoid single-label assumptions like softmax activation or LabelEncoder for multilabel data.
