0
0
MlopsHow-ToBeginner · 3 min read

How to Use StandardScaler in sklearn with Python

Use StandardScaler from sklearn.preprocessing to scale features by removing the mean and scaling to unit variance. First, create a scaler object, then fit it to your data with fit(), and transform your data with transform() or combine both with fit_transform().
📐

Syntax

The basic usage of StandardScaler involves creating an instance, fitting it to your data to learn the mean and standard deviation, and then transforming the data to scale it.

  • scaler = StandardScaler(): Creates the scaler object.
  • scaler.fit(X): Computes the mean and std of features in X.
  • scaler.transform(X): Scales X using the learned parameters.
  • scaler.fit_transform(X): Fits and transforms X in one step.
python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)  # Learn mean and std from data X
X_scaled = scaler.transform(X)  # Scale data

# Or combine both steps:
X_scaled = scaler.fit_transform(X)
💻

Example

This example shows how to scale a small dataset using StandardScaler. It fits the scaler to the data, transforms it, and prints the original and scaled data.

python
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data: 3 samples, 2 features
X = np.array([[1, 2], [3, 4], [5, 6]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Original data:\n", X)
print("\nScaled data:\n", X_scaled)
Output
Original data: [[1 2] [3 4] [5 6]] Scaled data: [[-1.22474487 -1.22474487] [ 0. 0. ] [ 1.22474487 1.22474487]]
⚠️

Common Pitfalls

Common mistakes when using StandardScaler include:

  • Fitting the scaler separately on training and test data, which causes data leakage. Always fit on training data only.
  • Not transforming test data with the same scaler fitted on training data.
  • Applying scaling before splitting data, which leaks information from test to train.

Correct usage fits the scaler on training data and transforms both training and test data with it.

python
from sklearn.preprocessing import StandardScaler
import numpy as np

# Correct: fitting scaler on training data only
X_train = np.array([[1, 2], [3, 4]])
X_test = np.array([[5, 6]])

scaler = StandardScaler()
scaler.fit(X_train)  # Fit only on training data
X_train_scaled = scaler.transform(X_train)

# Do not fit again on test data (causes leakage)
# scaler.fit(X_test)  # Don't do this
X_test_scaled = scaler.transform(X_test)  # Use the same scaler

print("Train scaled:\n", X_train_scaled)
print("Test scaled:\n", X_test_scaled)
Output
Train scaled: [[-1. -1.] [ 1. 1.]] Test scaled: [[3. 3.]]
📊

Quick Reference

Remember these key points when using StandardScaler:

  • Use fit() on training data only.
  • Use transform() on any data to scale it.
  • fit_transform() is a shortcut for fitting and transforming training data.
  • Scaled data has mean 0 and standard deviation 1 per feature.

Key Takeaways

Always fit StandardScaler on training data only to avoid data leakage.
Use fit_transform() to scale training data in one step.
Transform test or new data using the scaler fitted on training data.
StandardScaler centers data to mean 0 and scales to unit variance.
Scaling improves many machine learning models by normalizing feature ranges.