How to Tune XGBoost Hyperparameters in Python with sklearn
To tune
XGBoost hyperparameters in Python, use GridSearchCV or RandomizedSearchCV from sklearn.model_selection with an xgboost.XGBClassifier or XGBRegressor. Define a parameter grid with key hyperparameters like max_depth, learning_rate, and n_estimators, then fit the search to find the best combination.Syntax
Use GridSearchCV or RandomizedSearchCV to search over hyperparameter combinations. Pass the XGBoost model and a dictionary of parameters to try. Fit the search object on training data to find the best parameters.
estimator: The XGBoost model (XGBClassifierorXGBRegressor).param_grid: Dictionary of hyperparameters and their values to test.cv: Number of cross-validation folds.scoring: Metric to evaluate model performance.fit: Runs the search on training data.
python
from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV param_grid = { 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [50, 100, 200] } model = XGBClassifier(use_label_encoder=False, eval_metric='logloss') grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy')
Example
This example shows how to tune max_depth, learning_rate, and n_estimators for an XGBoost classifier on the Iris dataset using GridSearchCV. It prints the best parameters and the best accuracy score.
python
from xgboost import XGBClassifier from sklearn.datasets import load_iris from sklearn.model_selection import GridSearchCV, train_test_split from sklearn.metrics import accuracy_score # Load data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) # Define parameter grid param_grid = { 'max_depth': [3, 4, 5], 'learning_rate': [0.01, 0.1], 'n_estimators': [50, 100] } # Initialize model model = XGBClassifier(use_label_encoder=False, eval_metric='logloss') # Setup GridSearchCV grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy') # Fit grid search grid_search.fit(X_train, y_train) # Best parameters and score print('Best parameters:', grid_search.best_params_) print('Best cross-validation accuracy:', grid_search.best_score_) # Evaluate on test set best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_test) print('Test set accuracy:', accuracy_score(y_test, y_pred))
Output
Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}
Best cross-validation accuracy: 0.9583333333333334
Test set accuracy: 1.0
Common Pitfalls
Common mistakes when tuning XGBoost hyperparameters include:
- Not setting
use_label_encoder=Falseandeval_metricinXGBClassifier, which causes warnings. - Using too large parameter grids, leading to very long search times.
- Ignoring cross-validation and overfitting to training data.
- Not scaling or preprocessing data when needed.
Always start with a small grid and increase complexity gradually.
python
from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV # Wrong: missing eval_metric and use_label_encoder model_wrong = XGBClassifier() # Right: set parameters to avoid warnings model_right = XGBClassifier(use_label_encoder=False, eval_metric='logloss') param_grid = {'max_depth': [3, 5]} grid_search = GridSearchCV(model_right, param_grid, cv=3) # Fit grid_search on your data as usual
Quick Reference
| Hyperparameter | Description | Typical Values |
|---|---|---|
| max_depth | Maximum tree depth to control complexity | 3 to 10 |
| learning_rate | Step size shrinkage to prevent overfitting | 0.01 to 0.3 |
| n_estimators | Number of trees to build | 50 to 500 |
| subsample | Fraction of samples used per tree | 0.5 to 1.0 |
| colsample_bytree | Fraction of features used per tree | 0.5 to 1.0 |
| gamma | Minimum loss reduction to make a split | 0 to 5 |
Key Takeaways
Use GridSearchCV or RandomizedSearchCV with XGBClassifier or XGBRegressor to tune hyperparameters.
Start tuning with key parameters like max_depth, learning_rate, and n_estimators for best results.
Set use_label_encoder=False and eval_metric='logloss' to avoid warnings in XGBClassifier.
Keep parameter grids small initially to reduce computation time and avoid overfitting.
Evaluate tuned models with cross-validation and test data for reliable performance.