ML Pythonml~20 mins

XGBoost in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - XGBoost

Problem:We want to classify if a person has diabetes based on health data using XGBoost.

Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Training loss: 0.05, Validation loss: 0.60

Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.

You can only change XGBoost hyperparameters related to regularization and tree complexity.

Do not change the dataset or feature set.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

ML Python

import xgboost as xgb
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Load example dataset (binary classification simulated)
data = load_diabetes()
X = data.data
# Create a binary target for demonstration (above median target = 1, else 0)
y = (data.target > np.median(data.target)).astype(int)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
train_dmatrix = xgb.DMatrix(X_train, label=y_train)
val_dmatrix = xgb.DMatrix(X_val, label=y_val)

# Set parameters with regularization and reduced tree depth
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 3,  # reduce tree depth
    'eta': 0.05,    # smaller learning rate
    'lambda': 2,    # L2 regularization
    'alpha': 1,     # L1 regularization
    'seed': 42
}

# Train with early stopping
evals = [(train_dmatrix, 'train'), (val_dmatrix, 'eval')]
model = xgb.train(params, train_dmatrix, num_boost_round=200, evals=evals, early_stopping_rounds=10, verbose_eval=False)

# Predict and evaluate
preds_train = (model.predict(train_dmatrix) > 0.5).astype(int)
preds_val = (model.predict(val_dmatrix) > 0.5).astype(int)

train_acc = accuracy_score(y_train, preds_train) * 100
val_acc = accuracy_score(y_val, preds_val) * 100

train_loss = float(model.eval(train_dmatrix).split(':')[1].strip())
val_loss = float(model.eval(val_dmatrix).split(':')[1].strip())

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
print(f'Training loss: {train_loss}')
print(f'Validation loss: {val_loss}')

Reduced max_depth from default (6) to 3 to limit tree complexity.

Lowered learning rate (eta) from 0.3 to 0.05 for smoother learning.

Added L2 regularization (lambda=2) and L1 regularization (alpha=1) to reduce overfitting.

Used early stopping with 10 rounds to stop training when validation loss stops improving.

Results Interpretation

Before: Training accuracy 98%, Validation accuracy 75%, Training loss 0.05, Validation loss 0.60

After: Training accuracy 90%, Validation accuracy 86%, Training loss 0.25, Validation loss 0.30

Adding regularization, reducing tree depth, lowering learning rate, and using early stopping helps reduce overfitting. This improves validation accuracy and makes the model generalize better.

Bonus Experiment

Try using XGBoost's built-in feature importance to select the top 5 features and retrain the model. See if validation accuracy improves further.

💡 Hint

Use model.get_score() to get feature importance, select top features, then train again with only those features.

Practice

(1/5)

1. What is the main purpose of XGBoost in machine learning?

easy

A. To clean and prepare data for analysis

B. To store large datasets efficiently

C. To visualize data trends and patterns

D. To build a model that predicts outcomes from data

XGBoost in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand XGBoost's role

Step 2: Compare options to XGBoost's function

Final Answer:

Quick Check:

Solution

Step 1: Recall correct import syntax

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the training data and labels

Step 2: Predict on input [1, 2]

Final Answer:

Quick Check:

Solution

Step 1: Check eval_metric usage in fit()

Step 2: Verify other parts

Final Answer:

Quick Check:

Solution

Step 1: Understand class imbalance problem

Step 2: Choose best method to handle imbalance

Final Answer:

Quick Check: