ML Pythonml~20 mins

Imbalanced class handling (SMOTE, class weights) in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Imbalanced class handling (SMOTE, class weights)

Problem:We want to classify if a transaction is fraudulent or not. The dataset is imbalanced: only 5% of transactions are fraud. The current model has 98% training accuracy but only 70% validation accuracy.

Current Metrics:Training accuracy: 98%, Validation accuracy: 70%, Validation F1-score: 0.45

Issue:The model is overfitting and performs poorly on the minority class (fraud). It struggles to detect fraud cases due to class imbalance.

Your Task

Reduce overfitting and improve validation F1-score to at least 0.70 while maintaining training accuracy below 95%.

You can only modify data preprocessing and model training steps.

Do not change the model architecture.

Use either SMOTE or class weights or both to handle imbalance.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

ML Python

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from imblearn.over_sampling import SMOTE

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2,
                           n_redundant=10, n_clusters_per_class=1,
                           weights=[0.95, 0.05], flip_y=0, random_state=42)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Define class weights
class_weights = {0:1, 1:10}

# Train logistic regression with class weights
model = LogisticRegression(class_weight=class_weights, max_iter=1000, random_state=42)
model.fit(X_train_smote, y_train_smote)

# Predictions
y_train_pred = model.predict(X_train_smote)
y_val_pred = model.predict(X_val)

# Metrics
train_acc = accuracy_score(y_train_smote, y_train_pred) * 100
val_acc = accuracy_score(y_val, y_val_pred) * 100
val_f1 = f1_score(y_val, y_val_pred)

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
print(f"Validation F1-score: {val_f1:.2f}")

Applied SMOTE to balance the training data by creating synthetic minority class samples.

Used class weights in logistic regression to penalize errors on minority class more.

Kept model architecture same but improved data preprocessing and training strategy.

Results Interpretation

Before: Training accuracy 98%, Validation accuracy 70%, Validation F1-score 0.45

After: Training accuracy 93.5%, Validation accuracy 85%, Validation F1-score 0.72

Using SMOTE and class weights reduces overfitting and improves the model's ability to detect the minority class, shown by higher validation F1-score and better balanced accuracy.

Bonus Experiment

Try using only class weights without SMOTE and compare the validation F1-score.

💡 Hint

Remove SMOTE step and train logistic regression with class weights on original imbalanced data.

Practice

(1/5)

1. What is the main purpose of using SMOTE in machine learning?

easy

A. To create synthetic samples for minority classes to balance the dataset

B. To reduce the size of the majority class by removing samples

C. To increase the number of features in the dataset

D. To randomly shuffle the dataset before training

Imbalanced class handling (SMOTE, class weights) in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand SMOTE's role in imbalanced data

Step 2: Compare options with SMOTE's function

Final Answer:

Quick Check:

Solution

Step 1: Recall scikit-learn parameter for class weights

Step 2: Match options with correct syntax

Final Answer:

Quick Check:

Solution

Step 1: Count original class samples

Step 2: Understand SMOTE behavior on balanced data

Step 3: Check actual output

Final Answer:

Quick Check:

Solution

Step 1: Check class_weight dictionary keys

Step 2: Understand impact of wrong keys

Final Answer:

Quick Check:

Solution

Step 1: Understand dataset imbalance

Step 2: Combine SMOTE and class weights

Step 3: Why combining is best

Final Answer:

Quick Check: