MlopsHow-ToBeginner · 4 min read

How to Use Cross Validation for Tuning in Python with sklearn

Use GridSearchCV or RandomizedSearchCV from sklearn.model_selection to perform cross validation while tuning hyperparameters. These tools split your data into folds, train models on different sets, and find the best parameters based on validation scores automatically.

📐

Syntax

The main syntax for tuning with cross validation uses GridSearchCV or RandomizedSearchCV from sklearn.model_selection. You provide:

estimator: the model to tune (like RandomForestClassifier())
param_grid or param_distributions: dictionary of hyperparameters to try
cv: number of folds for cross validation
scoring: metric to evaluate model performance

Then call fit(X, y) to run tuning and cross validation.

python

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
model = RandomForestClassifier()
gs = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
gs.fit(X, y)

💻

Example

This example shows how to use GridSearchCV to tune a Random Forest classifier on the Iris dataset. It searches for the best number of trees and tree depth using 5-fold cross validation and prints the best parameters and accuracy.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)

# Define model and parameters
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 5, 10]}
model = RandomForestClassifier(random_state=42)

# Setup GridSearchCV
gs = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

# Run tuning
gs.fit(X, y)

# Best parameters and score
print('Best parameters:', gs.best_params_)
print('Best cross-validation accuracy:', gs.best_score_)

# Predict with best model
y_pred = gs.predict(X)
print('Training accuracy with best model:', accuracy_score(y, y_pred))

Output

Best parameters: {'max_depth': 5, 'n_estimators': 50} Best cross-validation accuracy: 0.9666666666666668 Training accuracy with best model: 1.0

⚠️

Common Pitfalls

Common mistakes when using cross validation for tuning:

Not splitting data properly before tuning, causing data leakage.
Using cv=1 or no cross validation, which defeats the purpose.
Choosing scoring metrics that don't match your problem (e.g., accuracy for imbalanced data).
Forgetting to set random_state for reproducibility.
Using too large parameter grids causing very long tuning times.

Always use cross validation inside tuning tools like GridSearchCV to avoid overfitting and get reliable performance estimates.

python

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Wrong: no cross validation (cv=1) - leads to overfitting
param_grid = {'n_estimators': [10, 50]}
model = RandomForestClassifier(random_state=42)
# cv=1 is not valid and will raise an error
# gs_wrong = GridSearchCV(model, param_grid, cv=1)  # cv=1 is not valid

# Right: use cv=5 for 5-fold cross validation
gs_right = GridSearchCV(model, param_grid, cv=5)

📊

Quick Reference

Tips for using cross validation tuning in sklearn:

Use GridSearchCV for exhaustive search over parameters.
Use RandomizedSearchCV for faster search with random sampling.
Set cv to 5 or 10 for balanced bias-variance tradeoff.
Choose scoring based on your problem (e.g., 'accuracy', 'roc_auc').
Access best model with best_estimator_ after fitting.

✅

Key Takeaways

Use GridSearchCV or RandomizedSearchCV with cv parameter to tune models with cross validation.

Cross validation splits data into folds to evaluate model performance reliably during tuning.

Choose appropriate scoring metrics and parameter grids to get meaningful tuning results.

Avoid data leakage by fitting tuning only on training data with cross validation.

Set random_state for reproducible tuning results.