How to Tune Decision Tree Hyperparameters in Python with sklearn
GridSearchCV from sklearn.model_selection with a parameter grid including options like max_depth, min_samples_split, and criterion. This helps find the best settings by testing combinations and evaluating model accuracy.Syntax
Use GridSearchCV to search over hyperparameter combinations for DecisionTreeClassifier. Define a parameter grid with keys as hyperparameter names and values as lists of options to try.
DecisionTreeClassifier(): The decision tree model.param_grid: Dictionary of hyperparameters to tune.GridSearchCV(estimator, param_grid, cv): Runs cross-validation to find best parameters.fit(X, y): Trains the model on data.best_params_: Best hyperparameters found.
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV param_grid = { 'max_depth': [3, 5, 10, None], 'min_samples_split': [2, 5, 10], 'criterion': ['gini', 'entropy'] } clf = DecisionTreeClassifier() grid_search = GridSearchCV(clf, param_grid, cv=5) grid_search.fit(X_train, y_train) print(grid_search.best_params_)
Example
This example shows how to tune a decision tree classifier on the iris dataset using GridSearchCV. It prints the best hyperparameters and the accuracy on test data.
from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV, train_test_split from sklearn.metrics import accuracy_score # Load data iris = load_iris() X, y = iris.data, iris.target # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Define parameter grid param_grid = { 'max_depth': [2, 3, 4, 5, None], 'min_samples_split': [2, 3, 4], 'criterion': ['gini', 'entropy'] } # Create model and grid search clf = DecisionTreeClassifier(random_state=42) grid_search = GridSearchCV(clf, param_grid, cv=4) grid_search.fit(X_train, y_train) # Best parameters print('Best hyperparameters:', grid_search.best_params_) # Predict and evaluate y_pred = grid_search.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy on test set: {accuracy:.3f}')
Common Pitfalls
Overfitting: Setting max_depth too high can make the tree memorize training data and perform poorly on new data.
Underfitting: Setting max_depth too low or min_samples_split too high can make the tree too simple to capture patterns.
Ignoring cross-validation: Not using cross-validation can lead to choosing hyperparameters that work only on training data.
Not scaling data: Decision trees do not require feature scaling, so scaling is unnecessary.
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV # Wrong: No cross-validation, just training score clf = DecisionTreeClassifier(max_depth=10) clf.fit(X_train, y_train) print('Training accuracy:', clf.score(X_train, y_train)) # Right: Use GridSearchCV with cross-validation param_grid = {'max_depth': [2, 5, 10]} grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5) grid_search.fit(X_train, y_train) print('Best max_depth:', grid_search.best_params_['max_depth'])
Quick Reference
Here are key hyperparameters to tune in DecisionTreeClassifier:
| Hyperparameter | Description | Typical Values |
|---|---|---|
| max_depth | Maximum depth of the tree | [None, 3, 5, 10] |
| min_samples_split | Minimum samples to split a node | [2, 5, 10] |
| min_samples_leaf | Minimum samples at a leaf node | [1, 2, 4] |
| criterion | Function to measure quality of split | ['gini', 'entropy'] |
| max_features | Number of features to consider at split | [None, 'sqrt', 'log2'] |