How to Use KFold Cross Validation in Python with sklearn
Use
KFold from sklearn.model_selection to split your data into multiple train-test sets. Loop through the splits to train and test your model on different parts of the data, which helps measure model performance more reliably.Syntax
The KFold class is initialized with parameters to control the number of splits and shuffling. You then use its split() method to generate train and test indices for each fold.
- n_splits: Number of folds (default 5).
- shuffle: Whether to shuffle data before splitting (default False).
- random_state: Seed for reproducible shuffling.
python
from sklearn.model_selection import KFold kf = KFold(n_splits=5, shuffle=True, random_state=42) for train_index, test_index in kf.split(X): # train_index and test_index are arrays of indices pass
Example
This example shows how to use KFold to split data, train a simple model on each fold, and print the accuracy scores.
python
from sklearn.datasets import load_iris from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import numpy as np # Load sample data X, y = load_iris(return_X_y=True) kf = KFold(n_splits=3, shuffle=True, random_state=1) accuracies = [] for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) y_pred = model.predict(X_test) acc = accuracy_score(y_test, y_pred) accuracies.append(acc) print(f"Fold accuracy: {acc:.3f}") print(f"Average accuracy: {np.mean(accuracies):.3f}")
Output
Fold accuracy: 1.000
Fold accuracy: 0.960
Fold accuracy: 0.960
Average accuracy: 0.973
Common Pitfalls
Common mistakes when using KFold include:
- Not shuffling data before splitting, which can cause biased folds if data is ordered.
- Using the same data for training and testing in a fold.
- Not resetting or reinitializing the model inside the loop, causing data leakage.
Always shuffle if your data is ordered and create a new model instance for each fold.
python
from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression kf = KFold(n_splits=3, shuffle=False) # No shuffle can cause biased splits for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model = LogisticRegression(max_iter=200) # Create new model each fold model.fit(X_train, y_train) # ...
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| n_splits | Number of folds to split data into | 5 |
| shuffle | Whether to shuffle data before splitting | False |
| random_state | Seed for reproducible shuffling | None |
| split(X) | Generates train/test indices for each fold | N/A |
Key Takeaways
Use KFold to split data into multiple train-test sets for reliable model evaluation.
Set shuffle=True to avoid biased splits if your data is ordered.
Create a new model instance inside the loop for each fold to prevent data leakage.
Use the split() method to get train and test indices for each fold.
Average the scores from all folds to get a robust estimate of model performance.