How to use StratifiedKFold sklearn in python

MlopsHow-ToBeginner · 4 min read

How to Use StratifiedKFold in sklearn with Python

Use StratifiedKFold from sklearn.model_selection to split your dataset into folds that preserve the percentage of samples for each class. Initialize it with the number of splits, then call split(X, y) to get train and test indices for each fold.

📐

Syntax

The basic syntax to use StratifiedKFold is:

n_splits: Number of folds to create.
shuffle: Whether to shuffle data before splitting (default is False).
random_state: Seed for reproducible shuffling.
split(X, y): Method to generate indices for train/test splits, where X is features and y is target labels.

python

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

💻

Example

This example shows how to use StratifiedKFold to split a small dataset with imbalanced classes into 3 folds, preserving class proportions in each fold.

python

import numpy as np
from sklearn.model_selection import StratifiedKFold

# Sample data: 10 samples with binary classes (0 and 1)
X = np.arange(10).reshape((10, 1))
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

fold = 1
for train_index, test_index in skf.split(X, y):
    print(f"Fold {fold}")
    print("Train indices:", train_index)
    print("Test indices:", test_index)
    print("Train class distribution:", np.bincount(y[train_index]))
    print("Test class distribution:", np.bincount(y[test_index]))
    print()
    fold += 1

Output

Fold 1 Train indices: [0 1 2 4 5 6 7] Test indices: [3 8 9] Train class distribution: [3 4] Test class distribution: [1 2] Fold 2 Train indices: [0 2 3 4 5 7 8 9] Test indices: [1 6] Train class distribution: [3 5] Test class distribution: [1 1] Fold 3 Train indices: [1 2 3 6 7 8 9] Test indices: [0 4 5] Train class distribution: [3 4] Test class distribution: [1 2]

⚠️

Common Pitfalls

Not using the target labels y in split(): StratifiedKFold requires the target array y to maintain class proportions. Omitting it causes errors or wrong splits.

Using shuffle=False with imbalanced data: Without shuffling, folds may be unbalanced if data is sorted by class.

Incorrect indexing of data arrays: Make sure your data arrays support indexing by numpy arrays (e.g., numpy arrays or pandas DataFrames).

python

from sklearn.model_selection import StratifiedKFold
import numpy as np

X = np.arange(6).reshape((6, 1))
y = np.array([0, 0, 1, 1, 1, 0])

skf = StratifiedKFold(n_splits=2)

# Wrong: missing y in split
try:
    for train_index, test_index in skf.split(X):
        pass
except TypeError as e:
    print(f"Error: {e}")

# Right: include y
for train_index, test_index in skf.split(X, y):
    print("Train indices:", train_index)
    print("Test indices:", test_index)

Output

Error: split() missing 1 required positional argument: 'y' Train indices: [0 2 3] Test indices: [1 4 5]

📊

Quick Reference

StratifiedKFold Cheat Sheet:

Parameter	Description
n_splits	Number of folds (default 5)
shuffle	Shuffle data before splitting (default False)
random_state	Seed for reproducible shuffling
split(X, y)	Generate train/test indices preserving class ratios
Use with classification tasks to keep class balance in folds

✅

Key Takeaways

StratifiedKFold splits data into folds that keep class proportions consistent.

Always pass both features X and target y to the split() method.

Use shuffle=True with random_state for randomized but reproducible splits.

It is ideal for classification tasks with imbalanced classes.

Check that your data supports indexing by numpy arrays for correct splits.