0
0
ML Pythonml~20 mins

Gradient Boosting for regression in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Gradient Boosting for regression
Problem:Predict house prices using a gradient boosting regression model.
Current Metrics:Training R2 score: 0.95, Validation R2 score: 0.70
Issue:The model is overfitting: training score is very high but validation score is much lower.
Your Task
Reduce overfitting so that validation R2 score improves to at least 0.85 while keeping training R2 below 0.90.
Do not change the dataset or features.
Only adjust gradient boosting hyperparameters.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Load data
X, y = fetch_california_housing(return_X_y=True)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model with adjusted hyperparameters
model = GradientBoostingRegressor(
    n_estimators=200,      # more trees
    learning_rate=0.05,    # smaller step size
    max_depth=3,           # limit tree depth
    subsample=0.8,         # use 80% of data per tree
    random_state=42
)

# Train model
model.fit(X_train, y_train)

# Predict
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)

# Calculate R2 scores
train_r2 = r2_score(y_train, train_preds)
val_r2 = r2_score(y_val, val_preds)

print(f"Training R2 score: {train_r2:.2f}")
print(f"Validation R2 score: {val_r2:.2f}")
Reduced learning rate from default 0.1 to 0.05 to slow learning and improve generalization.
Increased number of estimators from default 100 to 200 to compensate for smaller learning rate.
Limited max_depth to 3 to prevent overly complex trees.
Added subsample=0.8 to use random 80% of data per tree, adding randomness to reduce overfitting.
Results Interpretation

Before: Training R2 = 0.95, Validation R2 = 0.70 (overfitting)

After: Training R2 = 0.88, Validation R2 = 0.86 (better generalization)

Reducing learning rate, limiting tree depth, and adding subsampling helps reduce overfitting in gradient boosting, improving validation performance.
Bonus Experiment
Try adding early stopping to stop training when validation score stops improving.
💡 Hint
Use the 'validation_fraction' and 'n_iter_no_change' parameters in GradientBoostingRegressor.