0
0
MlopsHow-ToBeginner · 3 min read

How to Use train_test_split in sklearn with Python

Use train_test_split from sklearn.model_selection to split your data arrays into random train and test subsets. Pass your features and labels, specify test_size or train_size, and get back four arrays: training features, testing features, training labels, and testing labels.
📐

Syntax

The train_test_split function splits arrays or matrices into random train and test subsets.

  • X: Features data (input variables).
  • y: Labels or target data.
  • test_size: Fraction or number of samples for the test set.
  • train_size: Fraction or number of samples for the train set (optional).
  • random_state: Seed for random number generator to get reproducible splits.
  • shuffle: Whether to shuffle data before splitting (default is True).
python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)
💻

Example

This example shows how to split a simple dataset of features and labels into training and testing sets using train_test_split. It prints the shapes of the resulting arrays to confirm the split.

python
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data: 10 samples, 2 features each
X = np.arange(20).reshape((10, 2))
# Labels: 10 samples
y = np.arange(10)

# Split data: 30% test, 70% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
Output
X_train shape: (7, 2) X_test shape: (3, 2) y_train shape: (7,) y_test shape: (3,)
⚠️

Common Pitfalls

Not setting random_state can lead to different splits each run, making results hard to reproduce.
Forgetting to split both features and labels causes misalignment between inputs and targets.
Using test_size and train_size incorrectly can cause errors or unexpected splits.

python
from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.arange(5)

# Wrong: splitting only X, not y
X_train, X_test = train_test_split(X, test_size=0.4)

# Right: split both X and y together
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
📊

Quick Reference

ParameterDescriptionDefault
XInput features array or matrixRequired
yTarget labels arrayRequired
test_sizeProportion or number of test samplesNone (default 0.25 if train_size is None)
train_sizeProportion or number of train samplesNone
random_stateSeed for reproducible shufflingNone
shuffleWhether to shuffle before splittingTrue
stratifyData for stratified splittingNone

Key Takeaways

Always split both features and labels together to keep data aligned.
Set random_state to get the same train-test split every time.
Use test_size to control how much data goes to testing.
train_test_split shuffles data by default to mix samples well.
Check shapes of output arrays to confirm correct splitting.