How to Do Binary Classification in Python with sklearn
To do binary classification in Python, use
sklearn to load data, split it, train a classifier like LogisticRegression, and predict labels. Evaluate the model using metrics like accuracy or confusion matrix.Syntax
Binary classification in sklearn typically follows these steps:
- Import the classifier, e.g.,
LogisticRegression. - Prepare your data (features and labels).
- Split data into training and testing sets.
- Create and train the model with
fit(). - Make predictions with
predict(). - Evaluate results using metrics like accuracy.
python
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Example syntax X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred)
Example
This example shows how to do binary classification on the famous Iris dataset by classifying if a flower is Iris-Virginica or not.
python
from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix # Load Iris dataset iris = load_iris() X = iris.data # Create binary target: 1 if Iris-Virginica, else 0 y = (iris.target == 2).astype(int) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # Create and train model model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) # Evaluate acc = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print(f"Accuracy: {acc:.2f}") print("Confusion Matrix:") print(cm)
Output
Accuracy: 1.00
Confusion Matrix:
[[16 0]
[ 0 13]]
Common Pitfalls
Common mistakes when doing binary classification include:
- Not splitting data properly, leading to overfitting.
- Using unscaled features with some models that require scaling.
- Ignoring class imbalance which can bias the model.
- Confusing
predict_proba()withpredict().
Always check your data and evaluate with multiple metrics.
python
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Wrong: Using all data for training (no split) model = LogisticRegression(max_iter=200) model.fit(X, y) y_pred = model.predict(X) print(f"Accuracy without split: {accuracy_score(y, y_pred):.2f}") # Right: Split data first X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Accuracy with split: {accuracy_score(y_test, y_pred):.2f}")
Output
Accuracy without split: 1.00
Accuracy with split: 1.00
Quick Reference
Tips for binary classification with sklearn:
- Use
train_test_splitto avoid overfitting. - Try
LogisticRegression,RandomForestClassifier, orSVCfor models. - Evaluate with
accuracy_score,confusion_matrix, orroc_auc_score. - Scale features if needed using
StandardScaler. - Handle imbalanced classes with techniques like
class_weight='balanced'.
Key Takeaways
Always split your data into training and testing sets before training.
Use sklearn classifiers like LogisticRegression for simple binary classification.
Evaluate your model with accuracy and confusion matrix to understand performance.
Be careful with data preprocessing and class imbalance to improve results.
Use predict() for labels and predict_proba() for probabilities.