Feature Scaling in Machine Learning with Python using sklearn
sklearn.preprocessing provides tools like StandardScaler and MinMaxScaler to easily perform feature scaling, which helps models learn better and faster.How It Works
Feature scaling adjusts the range of data features so they are on a similar scale. Imagine you have a dataset with height in centimeters and weight in kilograms. Height values might be around 150-200, while weight might be 50-100. Without scaling, the model might think height is more important just because its numbers are bigger.
Scaling methods like standardization subtract the average and divide by the standard deviation, making data centered around zero with a spread of one. Another method, normalization, rescales data to a fixed range like 0 to 1. This helps machine learning algorithms treat all features equally and speeds up learning.
Example
This example shows how to use StandardScaler from sklearn.preprocessing to scale features in Python.
from sklearn.preprocessing import StandardScaler import numpy as np # Sample data: height (cm), weight (kg) data = np.array([[170, 65], [180, 80], [160, 55], [175, 75]]) scaler = StandardScaler() data_scaled = scaler.fit_transform(data) print("Original data:\n", data) print("\nScaled data:\n", data_scaled)
When to Use
Use feature scaling when your machine learning model depends on the distance or magnitude of features, such as in algorithms like k-nearest neighbors, support vector machines, and gradient descent-based models like linear regression or neural networks.
It is especially important when features have different units or scales, to prevent bias toward features with larger values. For example, in predicting house prices, features like area (square feet) and number of rooms should be scaled so the model treats them fairly.
Key Points
- Feature scaling makes data features comparable by adjusting their range.
StandardScalerstandardizes features to zero mean and unit variance.MinMaxScalerrescales features to a fixed range, usually 0 to 1.- Scaling improves model training speed and accuracy for many algorithms.
- Always fit the scaler on training data and apply the same transformation to test data.