0
0
ML Pythonprogramming~5 mins

Train-test split in ML Python

Choose your learning style9 modes available
Introduction
We split data into training and testing parts to teach the model and then check how well it learned on new data.
When you want to check if your model can predict new data well.
Before training a model to avoid cheating by testing on the same data it learned from.
When you have a dataset and want to measure model accuracy honestly.
To compare different models fairly using the same test data.
When tuning model settings and needing a reliable way to see improvements.
Syntax
ML Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X is your input features, y is the target labels.
test_size sets the fraction of data for testing (e.g., 0.25 means 25%).
Examples
Splits data so 20% is for testing and 80% for training.
ML Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Splits data with 30% test size and sets random_state for reproducible splits.
ML Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Splits data into half without shuffling the data before splitting.
ML Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)
Sample Program
This program loads iris flower data, splits it into training and testing parts, trains a decision tree, and shows how well it predicts unseen data.
ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load example data
iris = load_iris()
X, y = iris.data, iris.target

# Split data: 30% test, 70% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Create and train model
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)

# Predict on test data
predictions = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

print(f"Test set size: {len(X_test)} samples")
print(f"Accuracy on test set: {accuracy:.2f}")
OutputSuccess
Important Notes
Always set random_state to get the same split every time you run the code.
Shuffling before splitting helps mix data well unless order matters.
Test size depends on how much data you have; common values are 0.2 or 0.3.
Summary
Train-test split helps check model performance on new data.
Use sklearn's train_test_split to easily split data.
Set test_size and random_state for control and reproducibility.