How to Do Multiclass Classification in Python with sklearn
To do multiclass classification in Python, use sklearn classifiers like
LogisticRegression or RandomForestClassifier which support multiple classes by default. Prepare your data with features and labels, then fit the model using model.fit(X_train, y_train) and predict with model.predict(X_test).Syntax
Here is the basic syntax to perform multiclass classification using sklearn:
from sklearn.model_selection import train_test_split: Split data into training and testing sets.from sklearn.ensemble import RandomForestClassifier: Import a classifier that supports multiclass.model = RandomForestClassifier(): Create the model instance.model.fit(X_train, y_train): Train the model on training data.y_pred = model.predict(X_test): Predict classes for test data.
python
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create model model = RandomForestClassifier() # Train model model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test)
Example
This example shows how to classify the famous Iris dataset into three flower species using a Random Forest classifier.
python
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Load data iris = load_iris() X = iris.data y = iris.target # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # Create and train model model = RandomForestClassifier(random_state=1) model.fit(X_train, y_train) # Predict predictions = model.predict(X_test) # Accuracy accuracy = accuracy_score(y_test, predictions) print(f"Accuracy: {accuracy:.2f}")
Output
Accuracy: 1.00
Common Pitfalls
Common mistakes when doing multiclass classification include:
- Using classifiers that do not support multiclass natively (e.g., some binary classifiers).
- Not encoding labels properly if they are strings (use
LabelEncoder). - Confusing multiclass with multilabel classification.
- Ignoring data splitting which can lead to overfitting.
Always check your model supports multiclass and preprocess labels correctly.
python
from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder # Wrong: Using binary classifier without multiclass support # model = LogisticRegression(solver='liblinear') # solver='liblinear' supports only binary # Right: Use solver that supports multiclass model = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200) # Encode string labels if needed labels = ['cat', 'dog', 'mouse'] encoder = LabelEncoder() y_encoded = encoder.fit_transform(labels) print(y_encoded) # Output: [0 1 2]
Output
[0 1 2]
Quick Reference
Tips for multiclass classification in sklearn:
- Use classifiers like
RandomForestClassifier,LogisticRegression(with proper solver), orSVCwithdecision_function_shape='ovr'. - Split data into train and test sets to evaluate performance.
- Encode string labels with
LabelEncoderbefore training. - Use
accuracy_scoreorclassification_reportto check results.
Key Takeaways
Use sklearn classifiers that support multiclass classification by default.
Always split your data into training and testing sets to avoid overfitting.
Encode string labels with LabelEncoder before training your model.
Choose the right solver for LogisticRegression to handle multiclass tasks.
Evaluate your model using accuracy or classification reports.