0
0
MlopsHow-ToBeginner · 3 min read

How to Standardize Data in Python Using sklearn

To standardize data in Python, use StandardScaler from sklearn.preprocessing. It scales features to have zero mean and unit variance, which helps many machine learning models perform better.
📐

Syntax

The main steps to standardize data using StandardScaler are:

  • from sklearn.preprocessing import StandardScaler: Import the scaler.
  • scaler = StandardScaler(): Create a scaler object.
  • scaler.fit(data): Calculate mean and standard deviation from the data.
  • scaled_data = scaler.transform(data): Apply scaling to data.
  • Or combine fit and transform with scaled_data = scaler.fit_transform(data).
python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data)  # Learn mean and std from data
scaled_data = scaler.transform(data)  # Apply scaling

# Or do both in one step
scaled_data = scaler.fit_transform(data)
💻

Example

This example shows how to standardize a small dataset with two features. The output is the scaled data with mean 0 and variance 1 for each feature.

python
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data: 3 samples, 2 features
data = np.array([[10, 20],
                 [15, 24],
                 [14, 22]])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print("Original data:\n", data)
print("\nStandardized data:\n", scaled_data)

# Check mean and std of scaled data
print("\nMean of each feature after scaling:", scaled_data.mean(axis=0))
print("Std of each feature after scaling:", scaled_data.std(axis=0))
Output
Original data: [[10 20] [15 24] [14 22]] Standardized data: [[-1.29777137 -1.29777137] [ 1.13554995 1.29777137] [ 0.16222142 0. ]] Mean of each feature after scaling: [0. 0.] Std of each feature after scaling: [1. 1.]
⚠️

Common Pitfalls

Common mistakes when standardizing data include:

  • Fitting the scaler on the entire dataset including test data, which causes data leakage.
  • Not applying the same scaler to both training and test data.
  • Using standardization on data that is not numeric or contains missing values without preprocessing.

Always fit the scaler only on training data, then transform both training and test data with the same scaler.

python
from sklearn.preprocessing import StandardScaler
import numpy as np

# Wrong way: fitting scaler on all data (train + test)
all_data = np.array([[10, 20], [15, 24], [14, 22], [30, 40]])
scaler_wrong = StandardScaler()
scaler_wrong.fit(all_data)  # Includes test data

train_data = all_data[:3]
test_data = all_data[3:]
train_scaled_wrong = scaler_wrong.transform(train_data)

# Right way: fit only on train, transform train and test
scaler_right = StandardScaler()
scaler_right.fit(train_data)
train_scaled_right = scaler_right.transform(train_data)
test_scaled_right = scaler_right.transform(test_data)

print("Train scaled wrong:\n", train_scaled_wrong)
print("Test scaled right:\n", test_scaled_right)
Output
Train scaled wrong: [[-1.29777137 -1.29777137] [-0.16222142 -0.16222142] [-0.48666495 -0.72972973]] Test scaled right: [[3.16227766 3.16227766]]
📊

Quick Reference

Tips for standardizing data with sklearn:

  • Use StandardScaler for zero mean and unit variance scaling.
  • Fit scaler only on training data to avoid data leakage.
  • Apply transform on test data using the same scaler.
  • Check data is numeric and clean before scaling.

Key Takeaways

Use sklearn's StandardScaler to standardize data to zero mean and unit variance.
Always fit the scaler only on training data to prevent data leakage.
Apply the same scaler to transform both training and test datasets.
Standardization improves many machine learning models by normalizing feature scales.
Ensure data is numeric and preprocessed before applying standardization.