How to Use RobustScaler in sklearn with Python
Use
RobustScaler from sklearn.preprocessing to scale features by removing the median and scaling according to the interquartile range. Fit the scaler on training data with fit() and transform data with transform() or both at once with fit_transform().Syntax
The RobustScaler class is imported from sklearn.preprocessing. You create an instance with optional parameters like with_centering and with_scaling. Use fit() to learn scaling parameters from data, transform() to apply scaling, or fit_transform() to do both in one step.
- RobustScaler(): Creates the scaler object.
- fit(X): Computes median and interquartile range from data
X. - transform(X): Scales data
Xusing learned parameters. - fit_transform(X): Fits and transforms data
Xin one call.
python
from sklearn.preprocessing import RobustScaler scaler = RobustScaler(with_centering=True, with_scaling=True) scaler.fit(X_train) # Learn median and IQR from training data X_train_scaled = scaler.transform(X_train) # Scale training data X_test_scaled = scaler.transform(X_test) # Scale test data using same parameters
Example
This example shows how to use RobustScaler to scale a small dataset with outliers. It fits the scaler on training data, transforms both training and test data, and prints the results.
python
from sklearn.preprocessing import RobustScaler import numpy as np # Sample data with outliers X_train = np.array([[1, 2], [2, 3], [100, 200], [3, 4], [4, 5]]) X_test = np.array([[5, 6], [6, 7], [300, 400]]) # Create and fit scaler scaler = RobustScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print("Scaled training data:") print(X_train_scaled) print("\nScaled test data:") print(X_test_scaled)
Output
Scaled training data:
[[-0.4 -0.4]
[-0.2 -0.2]
[ 9.6 9.6]
[ 0. 0. ]
[ 0.2 0.2]]
Scaled test data:
[[0.4 0.4]
[0.6 0.6]
[14.8 14.8]]
Common Pitfalls
Common mistakes when using RobustScaler include:
- Fitting the scaler separately on training and test data, which causes data leakage and inconsistent scaling.
- Not fitting the scaler before transforming data, leading to errors.
- Using
RobustScalerwhen data does not have outliers, whereStandardScalermight be better.
python
from sklearn.preprocessing import RobustScaler import numpy as np X_train = np.array([[1, 2], [2, 3], [3, 4]]) X_test = np.array([[4, 5], [5, 6]]) # Wrong: fitting scaler separately on test data scaler_train = RobustScaler() X_train_scaled = scaler_train.fit_transform(X_train) scaler_test = RobustScaler() X_test_scaled_wrong = scaler_test.fit_transform(X_test) # Wrong: fits on test data # Right: fit on train, transform test X_test_scaled_right = scaler_train.transform(X_test) print("Wrong scaled test data:") print(X_test_scaled_wrong) print("\nRight scaled test data:") print(X_test_scaled_right)
Output
Wrong scaled test data:
[[-1. -1. ]
[ 1. 1. ]]
Right scaled test data:
[[1. 1.]
[2. 2.]]
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| with_centering | Whether to center data before scaling (subtract median) | True |
| with_scaling | Whether to scale data to IQR (interquartile range) | True |
| quantile_range | Range of quantiles used to calculate IQR (min, max) | (25.0, 75.0) |
| copy | Whether to perform operation in-place or copy data | True |
Key Takeaways
Always fit RobustScaler on training data only, then transform test data to avoid data leakage.
RobustScaler scales data using median and interquartile range, making it robust to outliers.
Use fit_transform() to fit and scale training data in one step for convenience.
Avoid fitting the scaler separately on test data to keep consistent scaling.
RobustScaler is best when your data contains outliers that affect mean and variance.