How to Split Data into Train and Test Sets in Python with sklearn
Use
train_test_split from sklearn.model_selection to split your data into training and testing sets. It randomly divides arrays or matrices into subsets, typically specifying the test size with test_size and a random seed with random_state for reproducibility.Syntax
The train_test_split function splits arrays or matrices into random train and test subsets.
arrays: The data arrays to split (e.g., features and labels).test_size: Fraction or number of samples for the test set.train_size: Fraction or number of samples for the train set (optional).random_state: Seed for random number generator to get reproducible splits.shuffle: Whether to shuffle data before splitting (default is True).
python
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42 )
Example
This example shows how to split a simple dataset of features and labels into 75% training and 25% testing data. The random_state ensures the split is the same every time you run it.
python
from sklearn.model_selection import train_test_split import numpy as np # Sample data: 10 samples, 2 features each X = np.arange(20).reshape((10, 2)) # Labels for each sample y = np.arange(10) # Split data: 75% train, 25% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=1 ) print('X_train:\n', X_train) print('X_test:\n', X_test) print('y_train:', y_train) print('y_test:', y_test)
Output
X_train:
[[ 8 9]
[14 15]
[ 0 1]
[ 4 5]
[16 17]
[ 2 3]
[12 13]]
X_test:
[[10 11]
[18 19]
[ 6 7]]
y_train: [4 7 0 2 8 1 6]
y_test: [5 9 3]
Common Pitfalls
Common mistakes when splitting data include:
- Not setting
random_state, which makes results different each run and hard to reproduce. - Using an incorrect
test_sizevalue (e.g., >1 or negative). - Not shuffling data when order matters, which can cause biased splits.
- Splitting features and labels separately without using
train_test_splittogether, causing misaligned data.
python
from sklearn.model_selection import train_test_split import numpy as np X = np.arange(10).reshape((5, 2)) y = np.arange(5) # Wrong: splitting features and labels separately (misaligned) X_train_wrong, X_test_wrong = train_test_split(X, test_size=0.4) y_train_wrong, y_test_wrong = train_test_split(y, test_size=0.4) # Right: split together to keep alignment X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0) print('Wrong X_train:', X_train_wrong) print('Wrong y_train:', y_train_wrong) print('Correct X_train:', X_train) print('Correct y_train:', y_train)
Output
Wrong X_train: [[4 5]
[0 1]
[2 3]]
Wrong y_train: [1 3 4]
Correct X_train: [[4 5]
[0 1]
[2 3]]
Correct y_train: [4 0 2]
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| arrays | Data arrays to split (features, labels) | Required |
| test_size | Proportion or number of test samples | 0.25 if train_size not set |
| train_size | Proportion or number of train samples | None (computed) |
| random_state | Seed for reproducible splits | None |
| shuffle | Shuffle data before splitting | True |
Key Takeaways
Use sklearn's train_test_split to split features and labels together for aligned data.
Set random_state to get the same train-test split every time you run your code.
Specify test_size as a fraction (e.g., 0.25) to control test set size.
Avoid splitting features and labels separately to prevent misalignment.
Shuffle data before splitting unless you have a reason not to.