0
0
ML Pythonprogramming~5 mins

Cross-validation (K-fold) in ML Python

Choose your learning style9 modes available
Introduction

Cross-validation helps check how well a model will work on new data. It splits data into parts to test the model many times.

When you want to know if your model will work well on unseen data.
When you have limited data and want to use it efficiently for training and testing.
When you want to compare different models fairly.
When tuning model settings to avoid overfitting.
When you want a more reliable estimate of model performance than a single train-test split.
Syntax
ML Python
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Train and test your model here

n_splits is how many parts to split the data into.

shuffle=True mixes data before splitting to get better random parts.

Examples
This splits data into 3 parts and prints sizes of train and test sets each time.
ML Python
kf = KFold(n_splits=3)
for train_idx, test_idx in kf.split(X):
    print("Train size:", len(train_idx), "Test size:", len(test_idx))
This splits data into 4 parts with shuffling for randomness and prints indices.
ML Python
kf = KFold(n_splits=4, shuffle=True, random_state=1)
for train_idx, test_idx in kf.split(X):
    print(train_idx, test_idx)
Sample Program

This program uses 5-fold cross-validation on the Iris dataset with logistic regression. It prints accuracy for each fold and the average accuracy.

ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Prepare KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

accuracies = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Create and train model
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)

    # Predict and evaluate
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

# Print average accuracy
print(f"Accuracies for each fold: {accuracies}")
print(f"Average accuracy: {np.mean(accuracies):.3f}")
OutputSuccess
Important Notes

Using shuffle=True is important to get random splits, especially if data is ordered.

K-fold cross-validation gives a better idea of model performance than a single train-test split.

Higher n_splits means more training rounds but takes more time.

Summary

K-fold cross-validation splits data into parts to test model many times.

It helps check model's ability to work on new data.

Use it to get reliable performance and avoid overfitting.