How to do multiclass classification python

MlopsHow-ToBeginner · 4 min read

How to Do Multiclass Classification in Python with sklearn

To do multiclass classification in Python, use sklearn classifiers like LogisticRegression or RandomForestClassifier which support multiple classes by default. Prepare your data with features and labels, then fit the model using model.fit(X_train, y_train) and predict with model.predict(X_test).

📐

Syntax

Here is the basic syntax to perform multiclass classification using sklearn:

from sklearn.model_selection import train_test_split: Split data into training and testing sets.
from sklearn.ensemble import RandomForestClassifier: Import a classifier that supports multiclass.
model = RandomForestClassifier(): Create the model instance.
model.fit(X_train, y_train): Train the model on training data.
y_pred = model.predict(X_test): Predict classes for test data.

python

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create model
model = RandomForestClassifier()

# Train model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

💻

Example

This example shows how to classify the famous Iris dataset into three flower species using a Random Forest classifier.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Create and train model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")

Output

Accuracy: 1.00

⚠️

Common Pitfalls

Common mistakes when doing multiclass classification include:

Using classifiers that do not support multiclass natively (e.g., some binary classifiers).
Not encoding labels properly if they are strings (use LabelEncoder).
Confusing multiclass with multilabel classification.
Ignoring data splitting which can lead to overfitting.

Always check your model supports multiclass and preprocess labels correctly.

python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

# Wrong: Using binary classifier without multiclass support
# model = LogisticRegression(solver='liblinear')  # solver='liblinear' supports only binary

# Right: Use solver that supports multiclass
model = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)

# Encode string labels if needed
labels = ['cat', 'dog', 'mouse']
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(labels)

print(y_encoded)  # Output: [0 1 2]

Output

[0 1 2]

📊

Quick Reference

Tips for multiclass classification in sklearn:

Use classifiers like RandomForestClassifier, LogisticRegression (with proper solver), or SVC with decision_function_shape='ovr'.
Split data into train and test sets to evaluate performance.
Encode string labels with LabelEncoder before training.
Use accuracy_score or classification_report to check results.

✅

Key Takeaways

Use sklearn classifiers that support multiclass classification by default.

Always split your data into training and testing sets to avoid overfitting.

Encode string labels with LabelEncoder before training your model.

Choose the right solver for LogisticRegression to handle multiclass tasks.

Evaluate your model using accuracy or classification reports.