0
0
MlopsHow-ToBeginner · 3 min read

How to Split Data into Train and Test Sets in Python with sklearn

Use train_test_split from sklearn.model_selection to split your data into training and testing sets. It randomly divides arrays or matrices into subsets, typically specifying the test size with test_size and a random seed with random_state for reproducibility.
📐

Syntax

The train_test_split function splits arrays or matrices into random train and test subsets.

  • arrays: The data arrays to split (e.g., features and labels).
  • test_size: Fraction or number of samples for the test set.
  • train_size: Fraction or number of samples for the train set (optional).
  • random_state: Seed for random number generator to get reproducible splits.
  • shuffle: Whether to shuffle data before splitting (default is True).
python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)
💻

Example

This example shows how to split a simple dataset of features and labels into 75% training and 25% testing data. The random_state ensures the split is the same every time you run it.

python
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data: 10 samples, 2 features each
X = np.arange(20).reshape((10, 2))
# Labels for each sample
y = np.arange(10)

# Split data: 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=1
)

print('X_train:\n', X_train)
print('X_test:\n', X_test)
print('y_train:', y_train)
print('y_test:', y_test)
Output
X_train: [[ 8 9] [14 15] [ 0 1] [ 4 5] [16 17] [ 2 3] [12 13]] X_test: [[10 11] [18 19] [ 6 7]] y_train: [4 7 0 2 8 1 6] y_test: [5 9 3]
⚠️

Common Pitfalls

Common mistakes when splitting data include:

  • Not setting random_state, which makes results different each run and hard to reproduce.
  • Using an incorrect test_size value (e.g., >1 or negative).
  • Not shuffling data when order matters, which can cause biased splits.
  • Splitting features and labels separately without using train_test_split together, causing misaligned data.
python
from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.arange(5)

# Wrong: splitting features and labels separately (misaligned)
X_train_wrong, X_test_wrong = train_test_split(X, test_size=0.4)
y_train_wrong, y_test_wrong = train_test_split(y, test_size=0.4)

# Right: split together to keep alignment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print('Wrong X_train:', X_train_wrong)
print('Wrong y_train:', y_train_wrong)
print('Correct X_train:', X_train)
print('Correct y_train:', y_train)
Output
Wrong X_train: [[4 5] [0 1] [2 3]] Wrong y_train: [1 3 4] Correct X_train: [[4 5] [0 1] [2 3]] Correct y_train: [4 0 2]
📊

Quick Reference

ParameterDescriptionDefault
arraysData arrays to split (features, labels)Required
test_sizeProportion or number of test samples0.25 if train_size not set
train_sizeProportion or number of train samplesNone (computed)
random_stateSeed for reproducible splitsNone
shuffleShuffle data before splittingTrue

Key Takeaways

Use sklearn's train_test_split to split features and labels together for aligned data.
Set random_state to get the same train-test split every time you run your code.
Specify test_size as a fraction (e.g., 0.25) to control test set size.
Avoid splitting features and labels separately to prevent misalignment.
Shuffle data before splitting unless you have a reason not to.