How to Use Cross Validation for Tuning in Python with sklearn
Use
GridSearchCV or RandomizedSearchCV from sklearn.model_selection to perform cross validation while tuning hyperparameters. These tools split your data into folds, train models on different sets, and find the best parameters based on validation scores automatically.Syntax
The main syntax for tuning with cross validation uses GridSearchCV or RandomizedSearchCV from sklearn.model_selection. You provide:
estimator: the model to tune (likeRandomForestClassifier())param_gridorparam_distributions: dictionary of hyperparameters to trycv: number of folds for cross validationscoring: metric to evaluate model performance
Then call fit(X, y) to run tuning and cross validation.
python
from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]} model = RandomForestClassifier() gs = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy') gs.fit(X, y)
Example
This example shows how to use GridSearchCV to tune a Random Forest classifier on the Iris dataset. It searches for the best number of trees and tree depth using 5-fold cross validation and prints the best parameters and accuracy.
python
from sklearn.datasets import load_iris from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Load data X, y = load_iris(return_X_y=True) # Define model and parameters param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 5, 10]} model = RandomForestClassifier(random_state=42) # Setup GridSearchCV gs = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy') # Run tuning gs.fit(X, y) # Best parameters and score print('Best parameters:', gs.best_params_) print('Best cross-validation accuracy:', gs.best_score_) # Predict with best model y_pred = gs.predict(X) print('Training accuracy with best model:', accuracy_score(y, y_pred))
Output
Best parameters: {'max_depth': 5, 'n_estimators': 50}
Best cross-validation accuracy: 0.9666666666666668
Training accuracy with best model: 1.0
Common Pitfalls
Common mistakes when using cross validation for tuning:
- Not splitting data properly before tuning, causing data leakage.
- Using
cv=1or no cross validation, which defeats the purpose. - Choosing scoring metrics that don't match your problem (e.g., accuracy for imbalanced data).
- Forgetting to set
random_statefor reproducibility. - Using too large parameter grids causing very long tuning times.
Always use cross validation inside tuning tools like GridSearchCV to avoid overfitting and get reliable performance estimates.
python
from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier # Wrong: no cross validation (cv=1) - leads to overfitting param_grid = {'n_estimators': [10, 50]} model = RandomForestClassifier(random_state=42) # cv=1 is not valid and will raise an error # gs_wrong = GridSearchCV(model, param_grid, cv=1) # cv=1 is not valid # Right: use cv=5 for 5-fold cross validation gs_right = GridSearchCV(model, param_grid, cv=5)
Quick Reference
Tips for using cross validation tuning in sklearn:
- Use
GridSearchCVfor exhaustive search over parameters. - Use
RandomizedSearchCVfor faster search with random sampling. - Set
cvto 5 or 10 for balanced bias-variance tradeoff. - Choose
scoringbased on your problem (e.g., 'accuracy', 'roc_auc'). - Access best model with
best_estimator_after fitting.
Key Takeaways
Use GridSearchCV or RandomizedSearchCV with cv parameter to tune models with cross validation.
Cross validation splits data into folds to evaluate model performance reliably during tuning.
Choose appropriate scoring metrics and parameter grids to get meaningful tuning results.
Avoid data leakage by fitting tuning only on training data with cross validation.
Set random_state for reproducible tuning results.