How to Use cross_val_score in sklearn with Python
Use
cross_val_score from sklearn.model_selection to evaluate a model by splitting data into folds and scoring each. Pass your model, data, target, and scoring method to get an array of scores showing model performance across folds.Syntax
The basic syntax of cross_val_score is:
estimator: The model you want to evaluate (e.g., a classifier or regressor).X: Your input features (data).y: The target labels or values.cv: Number of folds or cross-validation splitting strategy.scoring: Metric to evaluate the model (e.g., 'accuracy', 'neg_mean_squared_error').n_jobs: Number of CPU cores to use (optional).
The function returns an array of scores, one for each fold.
python
from sklearn.model_selection import cross_val_score scores = cross_val_score(estimator, X, y, cv=5, scoring='accuracy')
Example
This example shows how to use cross_val_score with a logistic regression model on the Iris dataset. It evaluates accuracy using 5-fold cross-validation.
python
from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score # Load data iris = load_iris() X, y = iris.data, iris.target # Create model model = LogisticRegression(max_iter=200) # Evaluate model with 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print('Accuracy scores for each fold:', scores) print('Mean accuracy:', scores.mean())
Output
Accuracy scores for each fold: [1. 0.97 0.97 0.97 1. ]
Mean accuracy: 0.982
Common Pitfalls
Common mistakes when using cross_val_score include:
- Not setting
cvproperly, which can lead to too few or too many folds. - Using the wrong
scoringmetric for your problem type (classification vs regression). - Not fitting the model inside cross-validation, which
cross_val_scorehandles automatically. - Passing data that is not shuffled or stratified when needed, causing biased results.
Always check that your scoring metric matches your task and that your data is prepared correctly.
python
from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target model = LinearRegression() # Wrong: Using accuracy scoring for regression model # scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') # This will error # Right: Use a regression metric like neg_mean_squared_error scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error') print('MSE scores:', scores)
Output
MSE scores: [-0.123 -0.145 -0.130 -0.140 -0.135]
Quick Reference
| Parameter | Description |
|---|---|
| estimator | Model object implementing fit and predict |
| X | Feature data (array-like) |
| y | Target labels or values |
| cv | Number of folds or cross-validation splitter (default=5) |
| scoring | Metric string or callable to evaluate model |
| n_jobs | Number of CPU cores to use (-1 for all cores) |
| verbose | Controls verbosity of output (0=no output) |
Key Takeaways
Use cross_val_score to easily evaluate model performance with cross-validation.
Pass your model, data, target, and scoring metric to get scores for each fold.
Choose the right scoring metric matching your task (classification or regression).
Set cv to control how many folds to split your data into for validation.
Check scores.mean() to get an overall performance estimate.