How to Use Cross Validation in sklearn with Python
Use
cross_val_score from sklearn.model_selection to perform cross validation in Python. It splits your data into folds, trains the model on some folds, and tests on others, returning scores for each split.Syntax
The main function for cross validation in sklearn is cross_val_score. It takes a model, data, target labels, and the number of folds to split the data into.
estimator: The machine learning model you want to evaluate.X: Your input features (data).y: The target labels (what you want to predict).cv: Number of folds (default is 5).scoring: Metric to evaluate (e.g., 'accuracy').
python
from sklearn.model_selection import cross_val_score scores = cross_val_score(estimator, X, y, cv=5, scoring='accuracy')
Example
This example shows how to use cross validation with a logistic regression model on the iris dataset. It prints accuracy scores for each fold and the average accuracy.
python
from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score # Load data iris = load_iris() X, y = iris.data, iris.target # Create model model = LogisticRegression(max_iter=200, random_state=42) # Perform 5-fold cross validation scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') # Print scores print('Accuracy scores for each fold:', scores) print('Average accuracy:', scores.mean())
Output
Accuracy scores for each fold: [1. 0.96666667 0.9 0.96666667 1. ]
Average accuracy: 0.9666666666666668
Common Pitfalls
Common mistakes when using cross validation include:
- Not setting
random_statein models that use randomness, leading to different results each run. - Using cross validation on data that is not shuffled or stratified, which can cause biased splits.
- Mixing training and test data before splitting, causing data leakage.
- Not choosing the right scoring metric for your problem.
Always use stratified splits for classification to keep class balance.
python
from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target # Wrong: no stratification, might cause imbalance scores_wrong = cross_val_score(LogisticRegression(max_iter=200, random_state=42), X, y, cv=5) # Right: use StratifiedKFold for classification from sklearn.model_selection import StratifiedKFold skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores_right = cross_val_score(LogisticRegression(max_iter=200, random_state=42), X, y, cv=skf) print('Scores without stratification:', scores_wrong) print('Scores with stratification:', scores_right)
Output
Scores without stratification: [1. 0.96666667 0.9 0.96666667 1. ]
Scores with stratification: [1. 0.96666667 0.9 0.96666667 1. ]
Quick Reference
| Parameter | Description | Example |
|---|---|---|
| estimator | Model to evaluate | LogisticRegression() |
| X | Input features | iris.data |
| y | Target labels | iris.target |
| cv | Number of folds or splitter | 5 or StratifiedKFold() |
| scoring | Metric to evaluate | 'accuracy', 'roc_auc' |
Key Takeaways
Use cross_val_score from sklearn.model_selection to easily perform cross validation.
Set cv to control how many folds your data is split into, commonly 5 or 10.
Use stratified splits for classification to keep class proportions balanced.
Choose the right scoring metric to match your problem goals.
Avoid data leakage by not mixing training and test data before splitting.