How to use cross_val_score sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use cross_val_score in sklearn with Python

Use cross_val_score from sklearn.model_selection to evaluate a model by splitting data into folds and scoring each. Pass your model, data, target, and scoring method to get an array of scores showing model performance across folds.

📐

Syntax

The basic syntax of cross_val_score is:

estimator: The model you want to evaluate (e.g., a classifier or regressor).
X: Your input features (data).
y: The target labels or values.
cv: Number of folds or cross-validation splitting strategy.
scoring: Metric to evaluate the model (e.g., 'accuracy', 'neg_mean_squared_error').
n_jobs: Number of CPU cores to use (optional).

The function returns an array of scores, one for each fold.

python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator, X, y, cv=5, scoring='accuracy')

💻

Example

This example shows how to use cross_val_score with a logistic regression model on the Iris dataset. It evaluates accuracy using 5-fold cross-validation.

python

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create model
model = LogisticRegression(max_iter=200)

# Evaluate model with 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print('Accuracy scores for each fold:', scores)
print('Mean accuracy:', scores.mean())

Output

Accuracy scores for each fold: [1. 0.97 0.97 0.97 1. ] Mean accuracy: 0.982

⚠️

Common Pitfalls

Common mistakes when using cross_val_score include:

Not setting cv properly, which can lead to too few or too many folds.
Using the wrong scoring metric for your problem type (classification vs regression).
Not fitting the model inside cross-validation, which cross_val_score handles automatically.
Passing data that is not shuffled or stratified when needed, causing biased results.

Always check that your scoring metric matches your task and that your data is prepared correctly.

python

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
model = LinearRegression()

# Wrong: Using accuracy scoring for regression model
# scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')  # This will error

# Right: Use a regression metric like neg_mean_squared_error
scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
print('MSE scores:', scores)

Output

MSE scores: [-0.123 -0.145 -0.130 -0.140 -0.135]

📊

Quick Reference

Parameter	Description
estimator	Model object implementing fit and predict
X	Feature data (array-like)
y	Target labels or values
cv	Number of folds or cross-validation splitter (default=5)
scoring	Metric string or callable to evaluate model
n_jobs	Number of CPU cores to use (-1 for all cores)
verbose	Controls verbosity of output (0=no output)

✅

Key Takeaways

Use cross_val_score to easily evaluate model performance with cross-validation.

Pass your model, data, target, and scoring metric to get scores for each fold.

Choose the right scoring metric matching your task (classification or regression).

Set cv to control how many folds to split your data into for validation.

Check scores.mean() to get an overall performance estimate.