How to Do Multilabel Classification in Python with sklearn
Use
sklearn.multioutput.MultiOutputClassifier or sklearn.multiclass.OneVsRestClassifier with a base classifier to handle multilabel classification in Python. Prepare your data with multiple target labels per sample, then fit and predict using these wrappers around classifiers like RandomForestClassifier.Syntax
To do multilabel classification in Python with sklearn, wrap a base classifier with MultiOutputClassifier or OneVsRestClassifier. Then use fit(X, Y) where X is your features and Y is a 2D array of labels (one column per label). Use predict(X) to get multilabel predictions.
MultiOutputClassifier(estimator): Fits one classifier per label.OneVsRestClassifier(estimator): Fits one classifier per label treating others as negative.estimator: Any sklearn classifier likeRandomForestClassifier().
python
from sklearn.multioutput import MultiOutputClassifier from sklearn.ensemble import RandomForestClassifier # Create multilabel classifier model = MultiOutputClassifier(RandomForestClassifier()) # Fit with X (features) and Y (multilabel targets) model.fit(X, Y) # Predict multilabel outputs predictions = model.predict(X_test)
Example
This example shows how to train and predict multilabel classification using MultiOutputClassifier with a random forest on a synthetic dataset.
python
from sklearn.datasets import make_multilabel_classification from sklearn.multioutput import MultiOutputClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, hamming_loss # Generate synthetic multilabel data X, Y = make_multilabel_classification(n_samples=100, n_features=5, n_classes=3, random_state=42) # Split data X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42) # Create multilabel classifier model = MultiOutputClassifier(RandomForestClassifier(random_state=42)) # Train model model.fit(X_train, Y_train) # Predict multilabel targets Y_pred = model.predict(X_test) # Evaluate with accuracy per label and Hamming loss acc = accuracy_score(Y_test, Y_pred) hloss = hamming_loss(Y_test, Y_pred) print(f"Accuracy (exact match): {acc:.2f}") print(f"Hamming Loss: {hloss:.2f}")
Output
Accuracy (exact match): 0.37
Hamming Loss: 0.13
Common Pitfalls
- Wrong target shape: Multilabel targets must be a 2D array with shape (samples, labels), not a 1D array.
- Using single-label classifiers directly: Classifiers like
RandomForestClassifieralone expect single-label targets, so wrap them withMultiOutputClassifierorOneVsRestClassifier. - Confusing multilabel with multiclass: Multilabel means multiple labels per sample; multiclass means one label from many classes.
python
from sklearn.ensemble import RandomForestClassifier # WRONG: single-label classifier with multilabel targets model = RandomForestClassifier() # model.fit(X, Y) # This will raise an error if Y is multilabel # RIGHT: wrap with MultiOutputClassifier from sklearn.multioutput import MultiOutputClassifier model = MultiOutputClassifier(RandomForestClassifier()) model.fit(X, Y)
Quick Reference
Key points for multilabel classification in sklearn:
- Use
MultiOutputClassifierorOneVsRestClassifierto extend single-label classifiers. - Input targets must be 2D arrays with one column per label.
- Evaluate with metrics like
hamming_lossoraccuracy_score(exact match). - Common base classifiers:
RandomForestClassifier,LogisticRegression,SVC.
Key Takeaways
Wrap single-label classifiers with MultiOutputClassifier or OneVsRestClassifier for multilabel tasks.
Prepare multilabel targets as 2D arrays with one column per label.
Use metrics like hamming loss to evaluate multilabel classification performance.
Avoid using single-label classifiers directly on multilabel data to prevent errors.
RandomForestClassifier is a good base model for multilabel classification with sklearn.