How to Use PCA for Dimensionality Reduction in Python with sklearn
Use
sklearn.decomposition.PCA to reduce data dimensions by fitting it to your data and transforming it. Initialize PCA with the number of components you want, then call fit_transform() on your dataset to get the reduced features.Syntax
The basic syntax to use PCA in Python with sklearn is:
PCA(n_components): Create a PCA object specifying how many dimensions to keep.fit_transform(data): Fit the PCA model to your data and reduce its dimensions in one step.explained_variance_ratio_: Attribute showing how much variance each principal component explains.
python
from sklearn.decomposition import PCA pca = PCA(n_components=2) # keep 2 principal components reduced_data = pca.fit_transform(data) # fit and reduce data dimensions print(pca.explained_variance_ratio_) # variance explained by each component
Example
This example shows how to reduce a 4-dimensional dataset to 2 dimensions using PCA and prints the reduced data and variance explained.
python
from sklearn.decomposition import PCA from sklearn.datasets import load_iris # Load sample data iris = load_iris() data = iris.data # Initialize PCA to reduce to 2 components pca = PCA(n_components=2) # Fit PCA and transform data reduced_data = pca.fit_transform(data) # Print reduced data shape and first 5 rows print('Reduced data shape:', reduced_data.shape) print('First 5 rows of reduced data:\n', reduced_data[:5]) # Print variance explained by each component print('Explained variance ratio:', pca.explained_variance_ratio_)
Output
Reduced data shape: (150, 2)
First 5 rows of reduced data:
[[-2.68412563 0.31939725]
[-2.71414169 -0.17700123]
[-2.88899057 -0.14494943]
[-2.74534286 -0.31829898]
[-2.72871654 0.32675451]]
Explained variance ratio: [0.92461872 0.05306648]
Common Pitfalls
Common mistakes when using PCA include:
- Not scaling data before PCA, which can cause features with larger scales to dominate.
- Choosing too many or too few components without checking explained variance.
- Using PCA on categorical data without encoding.
Always scale your data (e.g., with StandardScaler) before applying PCA for best results.
python
from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.datasets import load_iris # Load data iris = load_iris() data = iris.data # Wrong: PCA without scaling pca_wrong = PCA(n_components=2) reduced_wrong = pca_wrong.fit_transform(data) print('Explained variance without scaling:', pca_wrong.explained_variance_ratio_) # Right: Scale data first scaler = StandardScaler() data_scaled = scaler.fit_transform(data) pca_right = PCA(n_components=2) reduced_right = pca_right.fit_transform(data_scaled) print('Explained variance with scaling:', pca_right.explained_variance_ratio_)
Output
Explained variance without scaling: [0.92461872 0.05306648]
Explained variance with scaling: [0.72962445 0.22850762]
Quick Reference
Remember these tips when using PCA:
- Use
n_componentsto set how many dimensions to keep. - Always scale your data before PCA.
- Check
explained_variance_ratio_to understand how much information is retained. - Use
fit_transform()to apply PCA in one step.
Key Takeaways
Initialize PCA with the number of components to keep using
PCA(n_components).Always scale your data before applying PCA to get meaningful results.
Use
fit_transform() to reduce data dimensions in one step.Check
explained_variance_ratio_ to see how much variance each component captures.PCA works best on numerical data and reduces redundancy by combining correlated features.