0
0
MlopsHow-ToBeginner · 4 min read

How to Do Binary Classification in Python with sklearn

To do binary classification in Python, use sklearn to load data, split it, train a classifier like LogisticRegression, and predict labels. Evaluate the model using metrics like accuracy or confusion matrix.
📐

Syntax

Binary classification in sklearn typically follows these steps:

  • Import the classifier, e.g., LogisticRegression.
  • Prepare your data (features and labels).
  • Split data into training and testing sets.
  • Create and train the model with fit().
  • Make predictions with predict().
  • Evaluate results using metrics like accuracy.
python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example syntax
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
💻

Example

This example shows how to do binary classification on the famous Iris dataset by classifying if a flower is Iris-Virginica or not.

python
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load Iris dataset
iris = load_iris()
X = iris.data
# Create binary target: 1 if Iris-Virginica, else 0
y = (iris.target == 2).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create and train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {acc:.2f}")
print("Confusion Matrix:")
print(cm)
Output
Accuracy: 1.00 Confusion Matrix: [[16 0] [ 0 13]]
⚠️

Common Pitfalls

Common mistakes when doing binary classification include:

  • Not splitting data properly, leading to overfitting.
  • Using unscaled features with some models that require scaling.
  • Ignoring class imbalance which can bias the model.
  • Confusing predict_proba() with predict().

Always check your data and evaluate with multiple metrics.

python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Wrong: Using all data for training (no split)
model = LogisticRegression(max_iter=200)
model.fit(X, y)
y_pred = model.predict(X)
print(f"Accuracy without split: {accuracy_score(y, y_pred):.2f}")

# Right: Split data first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy with split: {accuracy_score(y_test, y_pred):.2f}")
Output
Accuracy without split: 1.00 Accuracy with split: 1.00
📊

Quick Reference

Tips for binary classification with sklearn:

  • Use train_test_split to avoid overfitting.
  • Try LogisticRegression, RandomForestClassifier, or SVC for models.
  • Evaluate with accuracy_score, confusion_matrix, or roc_auc_score.
  • Scale features if needed using StandardScaler.
  • Handle imbalanced classes with techniques like class_weight='balanced'.

Key Takeaways

Always split your data into training and testing sets before training.
Use sklearn classifiers like LogisticRegression for simple binary classification.
Evaluate your model with accuracy and confusion matrix to understand performance.
Be careful with data preprocessing and class imbalance to improve results.
Use predict() for labels and predict_proba() for probabilities.