How to Use train_test_split in sklearn with Python
Use
train_test_split from sklearn.model_selection to split your data arrays into random train and test subsets. Pass your features and labels, specify test_size or train_size, and get back four arrays: training features, testing features, training labels, and testing labels.Syntax
The train_test_split function splits arrays or matrices into random train and test subsets.
- X: Features data (input variables).
- y: Labels or target data.
- test_size: Fraction or number of samples for the test set.
- train_size: Fraction or number of samples for the train set (optional).
- random_state: Seed for random number generator to get reproducible splits.
- shuffle: Whether to shuffle data before splitting (default is True).
python
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)
Example
This example shows how to split a simple dataset of features and labels into training and testing sets using train_test_split. It prints the shapes of the resulting arrays to confirm the split.
python
from sklearn.model_selection import train_test_split import numpy as np # Sample data: 10 samples, 2 features each X = np.arange(20).reshape((10, 2)) # Labels: 10 samples y = np.arange(10) # Split data: 30% test, 70% train X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) print('X_train shape:', X_train.shape) print('X_test shape:', X_test.shape) print('y_train shape:', y_train.shape) print('y_test shape:', y_test.shape)
Output
X_train shape: (7, 2)
X_test shape: (3, 2)
y_train shape: (7,)
y_test shape: (3,)
Common Pitfalls
Not setting random_state can lead to different splits each run, making results hard to reproduce.
Forgetting to split both features and labels causes misalignment between inputs and targets.
Using test_size and train_size incorrectly can cause errors or unexpected splits.
python
from sklearn.model_selection import train_test_split import numpy as np X = np.arange(10).reshape((5, 2)) y = np.arange(5) # Wrong: splitting only X, not y X_train, X_test = train_test_split(X, test_size=0.4) # Right: split both X and y together X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| X | Input features array or matrix | Required |
| y | Target labels array | Required |
| test_size | Proportion or number of test samples | None (default 0.25 if train_size is None) |
| train_size | Proportion or number of train samples | None |
| random_state | Seed for reproducible shuffling | None |
| shuffle | Whether to shuffle before splitting | True |
| stratify | Data for stratified splitting | None |
Key Takeaways
Always split both features and labels together to keep data aligned.
Set random_state to get the same train-test split every time.
Use test_size to control how much data goes to testing.
train_test_split shuffles data by default to mix samples well.
Check shapes of output arrays to confirm correct splitting.