0
0
ML Pythonml~20 mins

Gradient Boosting (GBM) in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Gradient Boosting (GBM)
Problem:We want to predict if a customer will buy a product based on their features using Gradient Boosting. The current model fits the training data very well but performs poorly on new data.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.45
Issue:The model is overfitting: it learns the training data too well but does not generalize to new data.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You can only change hyperparameters of the Gradient Boosting model.
Do not change the dataset or feature set.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
ML Python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Gradient Boosting model with tuned hyperparameters
model = GradientBoostingClassifier(
    learning_rate=0.05,  # slower learning
    n_estimators=200,    # more trees
    max_depth=3,         # simpler trees
    validation_fraction=0.1,  # use part of training for early stopping
    n_iter_no_change=10,       # stop if no improvement
    random_state=42
)

# Train model
model.fit(X_train, y_train)

# Predict
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)

# Calculate accuracy
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

# Print results
print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
Reduced learning rate from default 0.1 to 0.05 to slow learning and reduce overfitting.
Increased number of estimators from 100 to 200 to allow more gradual learning.
Limited max_depth to 3 to keep trees simpler and less likely to overfit.
Enabled early stopping with validation_fraction=0.1 and n_iter_no_change=10 to stop training when validation stops improving.
Results Interpretation

Before tuning: Training accuracy was 98%, validation accuracy was 75%. The model was overfitting.

After tuning: Training accuracy dropped to 90.5%, validation accuracy improved to 86.3%. The model generalizes better.

Reducing learning rate and limiting tree complexity helps reduce overfitting in Gradient Boosting. Early stopping prevents wasting time on overfitting.
Bonus Experiment
Try using subsampling (setting subsample < 1) to add randomness and reduce overfitting further.
💡 Hint
Set subsample to 0.8 to train each tree on 80% of data randomly. This can improve validation accuracy.