What is Stratified K-fold in ML Python?

ML Pythonprogramming~5 mins

Stratified K-fold in ML Python

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Stratified K-fold helps us split data into parts while keeping the same balance of groups in each part. This way, our model learns fairly from all groups.

When you have different classes in your data and want each part to have the same class balance.

When you want to test your model fairly on all types of data.

When your data is small and you want to use all of it for training and testing.

When you want to avoid bias from uneven group distribution in splits.

Syntax

ML Python

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

X is your data features, y is the labels or groups you want to keep balanced.

shuffle=True randomizes data before splitting to get better mixes.

Examples

Splits data into 3 parts keeping class balance.

ML Python

skf = StratifiedKFold(n_splits=3)
for train_idx, test_idx in skf.split(X, y):
    print('Train:', train_idx, 'Test:', test_idx)

Shuffle data before splitting into 4 parts for randomness.

ML Python

skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)
for train_idx, test_idx in skf.split(X, y):
    print(f'Train size: {len(train_idx)}, Test size: {len(test_idx)}')

Sample Program

This code splits the iris dataset into 3 parts with StratifiedKFold. It prints how many samples of each class are in train and test for each fold. This shows the class balance is kept.

ML Python

from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold
import numpy as np

# Load iris data
data = load_iris()
X = data.data
y = data.target

# Create StratifiedKFold with 3 splits
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

fold = 1
for train_index, test_index in skf.split(X, y):
    y_train, y_test = y[train_index], y[test_index]
    # Count classes in train and test
    unique_train, counts_train = np.unique(y_train, return_counts=True)
    unique_test, counts_test = np.unique(y_test, return_counts=True)
    print(f'Fold {fold}')
    print('Train class distribution:', dict(zip(unique_train, counts_train)))
    print('Test class distribution:', dict(zip(unique_test, counts_test)))
    fold += 1

OutputSuccess

Important Notes

StratifiedKFold is best for classification tasks where classes must be balanced.

If you use it for regression, it won't keep the target distribution balanced.

Always set random_state for reproducible splits.

Summary

Stratified K-fold splits data into parts keeping class balance in each part.

It helps models learn and test fairly on all classes.

Use it especially when classes are uneven or data is small.