0
0
MlopsHow-ToBeginner · 4 min read

How to Tune XGBoost Hyperparameters in Python with sklearn

To tune XGBoost hyperparameters in Python, use GridSearchCV or RandomizedSearchCV from sklearn.model_selection with an xgboost.XGBClassifier or XGBRegressor. Define a parameter grid with key hyperparameters like max_depth, learning_rate, and n_estimators, then fit the search to find the best combination.
📐

Syntax

Use GridSearchCV or RandomizedSearchCV to search over hyperparameter combinations. Pass the XGBoost model and a dictionary of parameters to try. Fit the search object on training data to find the best parameters.

  • estimator: The XGBoost model (XGBClassifier or XGBRegressor).
  • param_grid: Dictionary of hyperparameters and their values to test.
  • cv: Number of cross-validation folds.
  • scoring: Metric to evaluate model performance.
  • fit: Runs the search on training data.
python
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200]
}

model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy')
💻

Example

This example shows how to tune max_depth, learning_rate, and n_estimators for an XGBoost classifier on the Iris dataset using GridSearchCV. It prints the best parameters and the best accuracy score.

python
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [50, 100]
}

# Initialize model
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit grid search
grid_search.fit(X_train, y_train)

# Best parameters and score
print('Best parameters:', grid_search.best_params_)
print('Best cross-validation accuracy:', grid_search.best_score_)

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print('Test set accuracy:', accuracy_score(y_test, y_pred))
Output
Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50} Best cross-validation accuracy: 0.9583333333333334 Test set accuracy: 1.0
⚠️

Common Pitfalls

Common mistakes when tuning XGBoost hyperparameters include:

  • Not setting use_label_encoder=False and eval_metric in XGBClassifier, which causes warnings.
  • Using too large parameter grids, leading to very long search times.
  • Ignoring cross-validation and overfitting to training data.
  • Not scaling or preprocessing data when needed.

Always start with a small grid and increase complexity gradually.

python
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

# Wrong: missing eval_metric and use_label_encoder
model_wrong = XGBClassifier()

# Right: set parameters to avoid warnings
model_right = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

param_grid = {'max_depth': [3, 5]}
grid_search = GridSearchCV(model_right, param_grid, cv=3)
# Fit grid_search on your data as usual
📊

Quick Reference

HyperparameterDescriptionTypical Values
max_depthMaximum tree depth to control complexity3 to 10
learning_rateStep size shrinkage to prevent overfitting0.01 to 0.3
n_estimatorsNumber of trees to build50 to 500
subsampleFraction of samples used per tree0.5 to 1.0
colsample_bytreeFraction of features used per tree0.5 to 1.0
gammaMinimum loss reduction to make a split0 to 5

Key Takeaways

Use GridSearchCV or RandomizedSearchCV with XGBClassifier or XGBRegressor to tune hyperparameters.
Start tuning with key parameters like max_depth, learning_rate, and n_estimators for best results.
Set use_label_encoder=False and eval_metric='logloss' to avoid warnings in XGBClassifier.
Keep parameter grids small initially to reduce computation time and avoid overfitting.
Evaluate tuned models with cross-validation and test data for reliable performance.