How to Scale Features in Python Using sklearn
To scale features in Python, use
sklearn.preprocessing scalers like StandardScaler or MinMaxScaler. These tools transform your data so all features have similar ranges, improving model training and predictions.Syntax
Feature scaling in sklearn uses scaler classes with these steps:
- Import the scaler from
sklearn.preprocessing. - Create a scaler object (e.g.,
StandardScaler()). - Fit the scaler to your data using
fit()orfit_transform(). - Transform your data with
transform()or usefit_transform()to do both at once.
Example syntax:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaled_data = scaler.fit_transform(data)
python
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaled_data = scaler.fit_transform(data)
Example
This example shows how to scale a small dataset using StandardScaler and MinMaxScaler. It prints the original and scaled data to compare.
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler import numpy as np # Sample data with 3 features data = np.array([[10, 200, 3000], [20, 300, 4000], [30, 400, 5000], [40, 500, 6000]]) # StandardScaler: mean=0, std=1 scaler_std = StandardScaler() scaled_std = scaler_std.fit_transform(data) # MinMaxScaler: scales features to [0, 1] scaler_minmax = MinMaxScaler() scaled_minmax = scaler_minmax.fit_transform(data) print("Original data:\n", data) print("\nStandardScaler scaled data:\n", scaled_std) print("\nMinMaxScaler scaled data:\n", scaled_minmax)
Output
Original data:
[[ 10 200 3000]
[ 20 300 4000]
[ 30 400 5000]
[ 40 500 6000]]
StandardScaler scaled data:
[[-1.34164079 -1.34164079 -1.34164079]
[-0.4472136 -0.4472136 -0.4472136 ]
[ 0.4472136 0.4472136 0.4472136 ]
[ 1.34164079 1.34164079 1.34164079]]
MinMaxScaler scaled data:
[[0. 0. 0. ]
[0.33333333 0.33333333 0.33333333]
[0.66666667 0.66666667 0.66666667]
[1. 1. 1. ]]
Common Pitfalls
Common mistakes when scaling features include:
- Scaling training and test data separately, which causes inconsistent scaling and poor model performance.
- Not fitting the scaler on training data only, then applying it to test data.
- Using scaling on categorical features that should be encoded differently.
Always fit the scaler on training data, then transform both training and test data with the same scaler.
python
from sklearn.preprocessing import StandardScaler import numpy as np # Training data train = np.array([[1, 2], [3, 4], [5, 6]]) # Test data test = np.array([[7, 8], [9, 10]]) # Wrong: fitting scaler separately on train and test scaler_wrong_train = StandardScaler() train_scaled_wrong = scaler_wrong_train.fit_transform(train) scaler_wrong_test = StandardScaler() test_scaled_wrong = scaler_wrong_test.fit_transform(test) # Right: fit on train, transform train and test scaler_right = StandardScaler() train_scaled_right = scaler_right.fit_transform(train) test_scaled_right = scaler_right.transform(test) print("Wrong train scaled:\n", train_scaled_wrong) print("Wrong test scaled:\n", test_scaled_wrong) print("Right train scaled:\n", train_scaled_right) print("Right test scaled:\n", test_scaled_right)
Output
Wrong train scaled:
[[-1.22474487 -1.22474487]
[ 0. 0. ]
[ 1.22474487 1.22474487]]
Wrong test scaled:
[[-1.22474487 -1.22474487]
[ 1.22474487 1.22474487]]
Right train scaled:
[[-1.22474487 -1.22474487]
[ 0. 0. ]
[ 1.22474487 1.22474487]]
Right test scaled:
[[2.44948974 2.44948974]
[3.67423461 3.67423461]]
Quick Reference
Summary tips for feature scaling in Python:
- StandardScaler: scales data to mean 0 and standard deviation 1.
- MinMaxScaler: scales data to a fixed range, usually 0 to 1.
- Always fit scaler on training data only.
- Apply the same scaler to test or new data with
transform(). - Do not scale categorical features directly.
Key Takeaways
Use sklearn scalers like StandardScaler or MinMaxScaler to scale features for better model performance.
Always fit the scaler on training data only, then transform test data with the same scaler.
StandardScaler centers data to mean 0 and std 1; MinMaxScaler scales data to a fixed range like 0 to 1.
Avoid scaling categorical features directly; encode them properly before scaling if needed.
Scaling ensures features have similar ranges, helping many machine learning algorithms work better.