How to split data into train and test in python

MlopsHow-ToBeginner · 3 min read

How to Split Data into Train and Test Sets in Python with sklearn

Use train_test_split from sklearn.model_selection to split your data into training and testing sets. It randomly divides arrays or matrices into subsets, typically specifying the test size with test_size and a random seed with random_state for reproducibility.

📐

Syntax

The train_test_split function splits arrays or matrices into random train and test subsets.

arrays: The data arrays to split (e.g., features and labels).
test_size: Fraction or number of samples for the test set.
train_size: Fraction or number of samples for the train set (optional).
random_state: Seed for random number generator to get reproducible splits.
shuffle: Whether to shuffle data before splitting (default is True).

python

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

💻

Example

This example shows how to split a simple dataset of features and labels into 75% training and 25% testing data. The random_state ensures the split is the same every time you run it.

python

from sklearn.model_selection import train_test_split
import numpy as np

# Sample data: 10 samples, 2 features each
X = np.arange(20).reshape((10, 2))
# Labels for each sample
y = np.arange(10)

# Split data: 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=1
)

print('X_train:\n', X_train)
print('X_test:\n', X_test)
print('y_train:', y_train)
print('y_test:', y_test)

Output

X_train: [[ 8 9] [14 15] [ 0 1] [ 4 5] [16 17] [ 2 3] [12 13]] X_test: [[10 11] [18 19] [ 6 7]] y_train: [4 7 0 2 8 1 6] y_test: [5 9 3]

⚠️

Common Pitfalls

Common mistakes when splitting data include:

Not setting random_state, which makes results different each run and hard to reproduce.
Using an incorrect test_size value (e.g., >1 or negative).
Not shuffling data when order matters, which can cause biased splits.
Splitting features and labels separately without using train_test_split together, causing misaligned data.

python

from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.arange(5)

# Wrong: splitting features and labels separately (misaligned)
X_train_wrong, X_test_wrong = train_test_split(X, test_size=0.4)
y_train_wrong, y_test_wrong = train_test_split(y, test_size=0.4)

# Right: split together to keep alignment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print('Wrong X_train:', X_train_wrong)
print('Wrong y_train:', y_train_wrong)
print('Correct X_train:', X_train)
print('Correct y_train:', y_train)

Output

Wrong X_train: [[4 5] [0 1] [2 3]] Wrong y_train: [1 3 4] Correct X_train: [[4 5] [0 1] [2 3]] Correct y_train: [4 0 2]

📊

Quick Reference

Parameter	Description	Default
arrays	Data arrays to split (features, labels)	Required
test_size	Proportion or number of test samples	0.25 if train_size not set
train_size	Proportion or number of train samples	None (computed)
random_state	Seed for reproducible splits	None
shuffle	Shuffle data before splitting	True

✅

Key Takeaways

Use sklearn's train_test_split to split features and labels together for aligned data.

Set random_state to get the same train-test split every time you run your code.

Specify test_size as a fraction (e.g., 0.25) to control test set size.

Avoid splitting features and labels separately to prevent misalignment.

Shuffle data before splitting unless you have a reason not to.