0
0
ML Pythonml~20 mins

XGBoost in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - XGBoost
Problem:We want to classify if a person has diabetes based on health data using XGBoost.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.60
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You can only change XGBoost hyperparameters related to regularization and tree complexity.
Do not change the dataset or feature set.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
ML Python
import xgboost as xgb
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Load example dataset (binary classification simulated)
data = load_diabetes()
X = data.data
# Create a binary target for demonstration (above median target = 1, else 0)
y = (data.target > np.median(data.target)).astype(int)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
train_dmatrix = xgb.DMatrix(X_train, label=y_train)
val_dmatrix = xgb.DMatrix(X_val, label=y_val)

# Set parameters with regularization and reduced tree depth
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 3,  # reduce tree depth
    'eta': 0.05,    # smaller learning rate
    'lambda': 2,    # L2 regularization
    'alpha': 1,     # L1 regularization
    'seed': 42
}

# Train with early stopping
evals = [(train_dmatrix, 'train'), (val_dmatrix, 'eval')]
model = xgb.train(params, train_dmatrix, num_boost_round=200, evals=evals, early_stopping_rounds=10, verbose_eval=False)

# Predict and evaluate
preds_train = (model.predict(train_dmatrix) > 0.5).astype(int)
preds_val = (model.predict(val_dmatrix) > 0.5).astype(int)

train_acc = accuracy_score(y_train, preds_train) * 100
val_acc = accuracy_score(y_val, preds_val) * 100

train_loss = float(model.eval(train_dmatrix).split(':')[1].strip())
val_loss = float(model.eval(val_dmatrix).split(':')[1].strip())

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
print(f'Training loss: {train_loss}')
print(f'Validation loss: {val_loss}')
Reduced max_depth from default (6) to 3 to limit tree complexity.
Lowered learning rate (eta) from 0.3 to 0.05 for smoother learning.
Added L2 regularization (lambda=2) and L1 regularization (alpha=1) to reduce overfitting.
Used early stopping with 10 rounds to stop training when validation loss stops improving.
Results Interpretation

Before: Training accuracy 98%, Validation accuracy 75%, Training loss 0.05, Validation loss 0.60

After: Training accuracy 90%, Validation accuracy 86%, Training loss 0.25, Validation loss 0.30

Adding regularization, reducing tree depth, lowering learning rate, and using early stopping helps reduce overfitting. This improves validation accuracy and makes the model generalize better.
Bonus Experiment
Try using XGBoost's built-in feature importance to select the top 5 features and retrain the model. See if validation accuracy improves further.
💡 Hint
Use model.get_score() to get feature importance, select top features, then train again with only those features.