What is Train-test split in ML Python?

ML Pythonprogramming~5 mins

Train-test split in ML Python

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We split data into training and testing parts to teach the model and then check how well it learned on new data.

When you want to check if your model can predict new data well.

Before training a model to avoid cheating by testing on the same data it learned from.

When you have a dataset and want to measure model accuracy honestly.

To compare different models fairly using the same test data.

When tuning model settings and needing a reliable way to see improvements.

Syntax

ML Python

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

X is your input features, y is the target labels.

test_size sets the fraction of data for testing (e.g., 0.25 means 25%).

Examples

Splits data so 20% is for testing and 80% for training.

ML Python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Splits data with 30% test size and sets random_state for reproducible splits.

ML Python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Splits data into half without shuffling the data before splitting.

ML Python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

Sample Program

This program loads iris flower data, splits it into training and testing parts, trains a decision tree, and shows how well it predicts unseen data.

ML Python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load example data
iris = load_iris()
X, y = iris.data, iris.target

# Split data: 30% test, 70% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Create and train model
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)

# Predict on test data
predictions = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

print(f"Test set size: {len(X_test)} samples")
print(f"Accuracy on test set: {accuracy:.2f}")

OutputSuccess

Important Notes

Always set random_state to get the same split every time you run the code.

Shuffling before splitting helps mix data well unless order matters.

Test size depends on how much data you have; common values are 0.2 or 0.3.

Summary

Train-test split helps check model performance on new data.

Use sklearn's train_test_split to easily split data.

Set test_size and random_state for control and reproducibility.