0
0
MlopsHow-ToBeginner · 3 min read

How to Do Multilabel Classification in Python with sklearn

Use sklearn.multioutput.MultiOutputClassifier or sklearn.multiclass.OneVsRestClassifier with a base classifier to handle multilabel classification in Python. Prepare your data with multiple target labels per sample, then fit and predict using these wrappers around classifiers like RandomForestClassifier.
📐

Syntax

To do multilabel classification in Python with sklearn, wrap a base classifier with MultiOutputClassifier or OneVsRestClassifier. Then use fit(X, Y) where X is your features and Y is a 2D array of labels (one column per label). Use predict(X) to get multilabel predictions.

  • MultiOutputClassifier(estimator): Fits one classifier per label.
  • OneVsRestClassifier(estimator): Fits one classifier per label treating others as negative.
  • estimator: Any sklearn classifier like RandomForestClassifier().
python
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

# Create multilabel classifier
model = MultiOutputClassifier(RandomForestClassifier())

# Fit with X (features) and Y (multilabel targets)
model.fit(X, Y)

# Predict multilabel outputs
predictions = model.predict(X_test)
💻

Example

This example shows how to train and predict multilabel classification using MultiOutputClassifier with a random forest on a synthetic dataset.

python
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, hamming_loss

# Generate synthetic multilabel data
X, Y = make_multilabel_classification(n_samples=100, n_features=5, n_classes=3, random_state=42)

# Split data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Create multilabel classifier
model = MultiOutputClassifier(RandomForestClassifier(random_state=42))

# Train model
model.fit(X_train, Y_train)

# Predict multilabel targets
Y_pred = model.predict(X_test)

# Evaluate with accuracy per label and Hamming loss
acc = accuracy_score(Y_test, Y_pred)
hloss = hamming_loss(Y_test, Y_pred)

print(f"Accuracy (exact match): {acc:.2f}")
print(f"Hamming Loss: {hloss:.2f}")
Output
Accuracy (exact match): 0.37 Hamming Loss: 0.13
⚠️

Common Pitfalls

  • Wrong target shape: Multilabel targets must be a 2D array with shape (samples, labels), not a 1D array.
  • Using single-label classifiers directly: Classifiers like RandomForestClassifier alone expect single-label targets, so wrap them with MultiOutputClassifier or OneVsRestClassifier.
  • Confusing multilabel with multiclass: Multilabel means multiple labels per sample; multiclass means one label from many classes.
python
from sklearn.ensemble import RandomForestClassifier

# WRONG: single-label classifier with multilabel targets
model = RandomForestClassifier()
# model.fit(X, Y)  # This will raise an error if Y is multilabel

# RIGHT: wrap with MultiOutputClassifier
from sklearn.multioutput import MultiOutputClassifier
model = MultiOutputClassifier(RandomForestClassifier())
model.fit(X, Y)
📊

Quick Reference

Key points for multilabel classification in sklearn:

  • Use MultiOutputClassifier or OneVsRestClassifier to extend single-label classifiers.
  • Input targets must be 2D arrays with one column per label.
  • Evaluate with metrics like hamming_loss or accuracy_score (exact match).
  • Common base classifiers: RandomForestClassifier, LogisticRegression, SVC.

Key Takeaways

Wrap single-label classifiers with MultiOutputClassifier or OneVsRestClassifier for multilabel tasks.
Prepare multilabel targets as 2D arrays with one column per label.
Use metrics like hamming loss to evaluate multilabel classification performance.
Avoid using single-label classifiers directly on multilabel data to prevent errors.
RandomForestClassifier is a good base model for multilabel classification with sklearn.