MlopsHow-ToBeginner · 4 min read

How to Tune XGBoost Hyperparameters in Python with sklearn

To tune XGBoost hyperparameters in Python, use GridSearchCV or RandomizedSearchCV from sklearn.model_selection with an xgboost.XGBClassifier or XGBRegressor. Define a parameter grid with key hyperparameters like max_depth, learning_rate, and n_estimators, then fit the search to find the best combination.

📐

Syntax

Use GridSearchCV or RandomizedSearchCV to search over hyperparameter combinations. Pass the XGBoost model and a dictionary of parameters to try. Fit the search object on training data to find the best parameters.

estimator: The XGBoost model (XGBClassifier or XGBRegressor).
param_grid: Dictionary of hyperparameters and their values to test.
cv: Number of cross-validation folds.
scoring: Metric to evaluate model performance.
fit: Runs the search on training data.

python

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200]
}

model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy')

💻

Example

This example shows how to tune max_depth, learning_rate, and n_estimators for an XGBoost classifier on the Iris dataset using GridSearchCV. It prints the best parameters and the best accuracy score.

python

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [50, 100]
}

# Initialize model
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit grid search
grid_search.fit(X_train, y_train)

# Best parameters and score
print('Best parameters:', grid_search.best_params_)
print('Best cross-validation accuracy:', grid_search.best_score_)

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print('Test set accuracy:', accuracy_score(y_test, y_pred))

Output

Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50} Best cross-validation accuracy: 0.9583333333333334 Test set accuracy: 1.0

⚠️

Common Pitfalls

Common mistakes when tuning XGBoost hyperparameters include:

Not setting use_label_encoder=False and eval_metric in XGBClassifier, which causes warnings.
Using too large parameter grids, leading to very long search times.
Ignoring cross-validation and overfitting to training data.
Not scaling or preprocessing data when needed.

Always start with a small grid and increase complexity gradually.

python

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

# Wrong: missing eval_metric and use_label_encoder
model_wrong = XGBClassifier()

# Right: set parameters to avoid warnings
model_right = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

param_grid = {'max_depth': [3, 5]}
grid_search = GridSearchCV(model_right, param_grid, cv=3)
# Fit grid_search on your data as usual

📊

Quick Reference

Hyperparameter	Description	Typical Values
max_depth	Maximum tree depth to control complexity	3 to 10
learning_rate	Step size shrinkage to prevent overfitting	0.01 to 0.3
n_estimators	Number of trees to build	50 to 500
subsample	Fraction of samples used per tree	0.5 to 1.0
colsample_bytree	Fraction of features used per tree	0.5 to 1.0
gamma	Minimum loss reduction to make a split	0 to 5

✅

Key Takeaways

Use GridSearchCV or RandomizedSearchCV with XGBClassifier or XGBRegressor to tune hyperparameters.

Start tuning with key parameters like max_depth, learning_rate, and n_estimators for best results.

Set use_label_encoder=False and eval_metric='logloss' to avoid warnings in XGBClassifier.

Keep parameter grids small initially to reduce computation time and avoid overfitting.

Evaluate tuned models with cross-validation and test data for reliable performance.