0
0
MlopsHow-ToBeginner · 3 min read

How to Use KFold Cross Validation in Python with sklearn

Use KFold from sklearn.model_selection to split your data into multiple train-test sets. Loop through the splits to train and test your model on different parts of the data, which helps measure model performance more reliably.
📐

Syntax

The KFold class is initialized with parameters to control the number of splits and shuffling. You then use its split() method to generate train and test indices for each fold.

  • n_splits: Number of folds (default 5).
  • shuffle: Whether to shuffle data before splitting (default False).
  • random_state: Seed for reproducible shuffling.
python
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X):
    # train_index and test_index are arrays of indices
    pass
💻

Example

This example shows how to use KFold to split data, train a simple model on each fold, and print the accuracy scores.

python
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Load sample data
X, y = load_iris(return_X_y=True)

kf = KFold(n_splits=3, shuffle=True, random_state=1)

accuracies = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Fold accuracy: {acc:.3f}")

print(f"Average accuracy: {np.mean(accuracies):.3f}")
Output
Fold accuracy: 1.000 Fold accuracy: 0.960 Fold accuracy: 0.960 Average accuracy: 0.973
⚠️

Common Pitfalls

Common mistakes when using KFold include:

  • Not shuffling data before splitting, which can cause biased folds if data is ordered.
  • Using the same data for training and testing in a fold.
  • Not resetting or reinitializing the model inside the loop, causing data leakage.

Always shuffle if your data is ordered and create a new model instance for each fold.

python
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

kf = KFold(n_splits=3, shuffle=False)  # No shuffle can cause biased splits

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = LogisticRegression(max_iter=200)  # Create new model each fold
    model.fit(X_train, y_train)
    # ...
📊

Quick Reference

ParameterDescriptionDefault
n_splitsNumber of folds to split data into5
shuffleWhether to shuffle data before splittingFalse
random_stateSeed for reproducible shufflingNone
split(X)Generates train/test indices for each foldN/A

Key Takeaways

Use KFold to split data into multiple train-test sets for reliable model evaluation.
Set shuffle=True to avoid biased splits if your data is ordered.
Create a new model instance inside the loop for each fold to prevent data leakage.
Use the split() method to get train and test indices for each fold.
Average the scores from all folds to get a robust estimate of model performance.