How to use cross validation sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use Cross Validation in sklearn with Python

Use cross_val_score from sklearn.model_selection to perform cross validation in Python. It splits your data into folds, trains the model on some folds, and tests on others, returning scores for each split.

📐

Syntax

The main function for cross validation in sklearn is cross_val_score. It takes a model, data, target labels, and the number of folds to split the data into.

estimator: The machine learning model you want to evaluate.
X: Your input features (data).
y: The target labels (what you want to predict).
cv: Number of folds (default is 5).
scoring: Metric to evaluate (e.g., 'accuracy').

python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator, X, y, cv=5, scoring='accuracy')

💻

Example

This example shows how to use cross validation with a logistic regression model on the iris dataset. It prints accuracy scores for each fold and the average accuracy.

python

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create model
model = LogisticRegression(max_iter=200, random_state=42)

# Perform 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

# Print scores
print('Accuracy scores for each fold:', scores)
print('Average accuracy:', scores.mean())

Output

Accuracy scores for each fold: [1. 0.96666667 0.9 0.96666667 1. ] Average accuracy: 0.9666666666666668

⚠️

Common Pitfalls

Common mistakes when using cross validation include:

Not setting random_state in models that use randomness, leading to different results each run.
Using cross validation on data that is not shuffled or stratified, which can cause biased splits.
Mixing training and test data before splitting, causing data leakage.
Not choosing the right scoring metric for your problem.

Always use stratified splits for classification to keep class balance.

python

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

# Wrong: no stratification, might cause imbalance
scores_wrong = cross_val_score(LogisticRegression(max_iter=200, random_state=42), X, y, cv=5)

# Right: use StratifiedKFold for classification
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_right = cross_val_score(LogisticRegression(max_iter=200, random_state=42), X, y, cv=skf)

print('Scores without stratification:', scores_wrong)
print('Scores with stratification:', scores_right)

Output

Scores without stratification: [1. 0.96666667 0.9 0.96666667 1. ] Scores with stratification: [1. 0.96666667 0.9 0.96666667 1. ]

📊

Quick Reference

Parameter	Description	Example
estimator	Model to evaluate	LogisticRegression()
X	Input features	iris.data
y	Target labels	iris.target
cv	Number of folds or splitter	5 or StratifiedKFold()
scoring	Metric to evaluate	'accuracy', 'roc_auc'

✅

Key Takeaways

Use cross_val_score from sklearn.model_selection to easily perform cross validation.

Set cv to control how many folds your data is split into, commonly 5 or 10.

Use stratified splits for classification to keep class proportions balanced.

Choose the right scoring metric to match your problem goals.

Avoid data leakage by not mixing training and test data before splitting.