How to use train_test_split sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use train_test_split in sklearn with Python

Use train_test_split from sklearn.model_selection to split your data arrays into random train and test subsets. Pass your features and labels, specify test_size or train_size, and get back four arrays: training features, testing features, training labels, and testing labels.

📐

Syntax

The train_test_split function splits arrays or matrices into random train and test subsets.

X: Features data (input variables).
y: Labels or target data.
test_size: Fraction or number of samples for the test set.
train_size: Fraction or number of samples for the train set (optional).
random_state: Seed for random number generator to get reproducible splits.
shuffle: Whether to shuffle data before splitting (default is True).

python

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)

💻

Example

This example shows how to split a simple dataset of features and labels into training and testing sets using train_test_split. It prints the shapes of the resulting arrays to confirm the split.

python

from sklearn.model_selection import train_test_split
import numpy as np

# Sample data: 10 samples, 2 features each
X = np.arange(20).reshape((10, 2))
# Labels: 10 samples
y = np.arange(10)

# Split data: 30% test, 70% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

Output

X_train shape: (7, 2) X_test shape: (3, 2) y_train shape: (7,) y_test shape: (3,)

⚠️

Common Pitfalls

Not setting random_state can lead to different splits each run, making results hard to reproduce.
Forgetting to split both features and labels causes misalignment between inputs and targets.
Using test_size and train_size incorrectly can cause errors or unexpected splits.

python

from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.arange(5)

# Wrong: splitting only X, not y
X_train, X_test = train_test_split(X, test_size=0.4)

# Right: split both X and y together
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

📊

Quick Reference

Parameter	Description	Default
X	Input features array or matrix	Required
y	Target labels array	Required
test_size	Proportion or number of test samples	None (default 0.25 if train_size is None)
train_size	Proportion or number of train samples	None
random_state	Seed for reproducible shuffling	None
shuffle	Whether to shuffle before splitting	True
stratify	Data for stratified splitting	None

✅

Key Takeaways

Always split both features and labels together to keep data aligned.

Set random_state to get the same train-test split every time.

Use test_size to control how much data goes to testing.

train_test_split shuffles data by default to mix samples well.

Check shapes of output arrays to confirm correct splitting.