Cross-validation helps check how well a model will work on new data. It splits data into parts to test the model many times.
0
0
Cross-validation (K-fold) in ML Python
Introduction
When you want to know if your model will work well on unseen data.
When you have limited data and want to use it efficiently for training and testing.
When you want to compare different models fairly.
When tuning model settings to avoid overfitting.
When you want a more reliable estimate of model performance than a single train-test split.
Syntax
ML Python
from sklearn.model_selection import KFold kf = KFold(n_splits=5, shuffle=True, random_state=42) for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Train and test your model here
n_splits is how many parts to split the data into.
shuffle=True mixes data before splitting to get better random parts.
Examples
This splits data into 3 parts and prints sizes of train and test sets each time.
ML Python
kf = KFold(n_splits=3) for train_idx, test_idx in kf.split(X): print("Train size:", len(train_idx), "Test size:", len(test_idx))
This splits data into 4 parts with shuffling for randomness and prints indices.
ML Python
kf = KFold(n_splits=4, shuffle=True, random_state=1) for train_idx, test_idx in kf.split(X): print(train_idx, test_idx)
Sample Program
This program uses 5-fold cross-validation on the Iris dataset with logistic regression. It prints accuracy for each fold and the average accuracy.
ML Python
from sklearn.datasets import load_iris from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import numpy as np # Load data iris = load_iris() X = iris.data y = iris.target # Prepare KFold kf = KFold(n_splits=5, shuffle=True, random_state=42) accuracies = [] for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Create and train model model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) acc = accuracy_score(y_test, y_pred) accuracies.append(acc) # Print average accuracy print(f"Accuracies for each fold: {accuracies}") print(f"Average accuracy: {np.mean(accuracies):.3f}")
OutputSuccess
Important Notes
Using shuffle=True is important to get random splits, especially if data is ordered.
K-fold cross-validation gives a better idea of model performance than a single train-test split.
Higher n_splits means more training rounds but takes more time.
Summary
K-fold cross-validation splits data into parts to test model many times.
It helps check model's ability to work on new data.
Use it to get reliable performance and avoid overfitting.