0
0
ML Pythonml~20 mins

LightGBM in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - LightGBM
Problem:You are using LightGBM to classify whether a patient has a disease based on medical data.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.45
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to above 85% while keeping training accuracy below 92%.
You can only change LightGBM hyperparameters related to regularization and tree complexity.
Do not change the dataset or feature set.
Hint 1
Hint 2
Hint 3
Hint 4
Hint 5
Solution
ML Python
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_breast_cancer()
X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Create dataset for LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

# Set parameters with regularization to reduce overfitting
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,  # smaller leaves to reduce complexity
    'max_depth': 5,   # limit tree depth
    'min_data_in_leaf': 20,  # avoid small leaves
    'feature_fraction': 0.8,  # use 80% features per tree
    'bagging_fraction': 0.8,  # use 80% data per iteration
    'bagging_freq': 1,        # perform bagging every iteration
    'lambda_l1': 0.5,         # L1 regularization
    'lambda_l2': 0.5,         # L2 regularization
    'verbose': -1
}

# Train model
model = lgb.train(params, train_data, num_boost_round=100, valid_sets=[val_data], early_stopping_rounds=10, verbose_eval=False)

# Predict and evaluate
train_pred = model.predict(X_train)
val_pred = model.predict(X_val)

# Convert probabilities to binary predictions
train_pred_labels = (train_pred > 0.5).astype(int)
val_pred_labels = (val_pred > 0.5).astype(int)

train_acc = accuracy_score(y_train, train_pred_labels) * 100
val_acc = accuracy_score(y_val, val_pred_labels) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Reduced 'num_leaves' to 31 to limit tree complexity.
Set 'max_depth' to 5 to prevent very deep trees.
Added 'min_data_in_leaf' of 20 to avoid overfitting on small data splits.
Used 'feature_fraction' and 'bagging_fraction' at 0.8 to randomly sample features and data, adding randomness.
Added L1 and L2 regularization with 'lambda_l1' and 'lambda_l2' set to 0.5.
Results Interpretation

Before: Training accuracy 98%, Validation accuracy 75%, Training loss 0.05, Validation loss 0.45

After: Training accuracy 90.5%, Validation accuracy 86.3%, Training loss 0.18, Validation loss 0.32

Adding regularization and limiting tree complexity reduces overfitting. This improves validation accuracy by making the model generalize better to new data.
Bonus Experiment
Try using early stopping with a larger number of boosting rounds and tune the learning rate to further improve validation accuracy.
💡 Hint
Lower the learning rate (e.g., 0.01) and increase boosting rounds (e.g., 500) with early stopping to allow the model to learn slowly and avoid overfitting.