0
0
MlopsHow-ToBeginner · 4 min read

How to Use RobustScaler in sklearn with Python

Use RobustScaler from sklearn.preprocessing to scale features by removing the median and scaling according to the interquartile range. Fit the scaler on training data with fit() and transform data with transform() or both at once with fit_transform().
📐

Syntax

The RobustScaler class is imported from sklearn.preprocessing. You create an instance with optional parameters like with_centering and with_scaling. Use fit() to learn scaling parameters from data, transform() to apply scaling, or fit_transform() to do both in one step.

  • RobustScaler(): Creates the scaler object.
  • fit(X): Computes median and interquartile range from data X.
  • transform(X): Scales data X using learned parameters.
  • fit_transform(X): Fits and transforms data X in one call.
python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler(with_centering=True, with_scaling=True)
scaler.fit(X_train)  # Learn median and IQR from training data
X_train_scaled = scaler.transform(X_train)  # Scale training data
X_test_scaled = scaler.transform(X_test)  # Scale test data using same parameters
💻

Example

This example shows how to use RobustScaler to scale a small dataset with outliers. It fits the scaler on training data, transforms both training and test data, and prints the results.

python
from sklearn.preprocessing import RobustScaler
import numpy as np

# Sample data with outliers
X_train = np.array([[1, 2], [2, 3], [100, 200], [3, 4], [4, 5]])
X_test = np.array([[5, 6], [6, 7], [300, 400]])

# Create and fit scaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Scaled training data:")
print(X_train_scaled)
print("\nScaled test data:")
print(X_test_scaled)
Output
Scaled training data: [[-0.4 -0.4] [-0.2 -0.2] [ 9.6 9.6] [ 0. 0. ] [ 0.2 0.2]] Scaled test data: [[0.4 0.4] [0.6 0.6] [14.8 14.8]]
⚠️

Common Pitfalls

Common mistakes when using RobustScaler include:

  • Fitting the scaler separately on training and test data, which causes data leakage and inconsistent scaling.
  • Not fitting the scaler before transforming data, leading to errors.
  • Using RobustScaler when data does not have outliers, where StandardScaler might be better.
python
from sklearn.preprocessing import RobustScaler
import numpy as np

X_train = np.array([[1, 2], [2, 3], [3, 4]])
X_test = np.array([[4, 5], [5, 6]])

# Wrong: fitting scaler separately on test data
scaler_train = RobustScaler()
X_train_scaled = scaler_train.fit_transform(X_train)

scaler_test = RobustScaler()
X_test_scaled_wrong = scaler_test.fit_transform(X_test)  # Wrong: fits on test data

# Right: fit on train, transform test
X_test_scaled_right = scaler_train.transform(X_test)

print("Wrong scaled test data:")
print(X_test_scaled_wrong)
print("\nRight scaled test data:")
print(X_test_scaled_right)
Output
Wrong scaled test data: [[-1. -1. ] [ 1. 1. ]] Right scaled test data: [[1. 1.] [2. 2.]]
📊

Quick Reference

ParameterDescriptionDefault
with_centeringWhether to center data before scaling (subtract median)True
with_scalingWhether to scale data to IQR (interquartile range)True
quantile_rangeRange of quantiles used to calculate IQR (min, max)(25.0, 75.0)
copyWhether to perform operation in-place or copy dataTrue

Key Takeaways

Always fit RobustScaler on training data only, then transform test data to avoid data leakage.
RobustScaler scales data using median and interquartile range, making it robust to outliers.
Use fit_transform() to fit and scale training data in one step for convenience.
Avoid fitting the scaler separately on test data to keep consistent scaling.
RobustScaler is best when your data contains outliers that affect mean and variance.