How to Use PCA in sklearn with Python: Simple Guide
Use
sklearn.decomposition.PCA to create a PCA object, then call fit or fit_transform on your data to reduce its dimensions. You can specify the number of components with n_components to control how many features to keep.Syntax
The basic syntax to use PCA in sklearn is:
PCA(n_components): Create a PCA object specifying how many components to keep.fit(X): Learn the principal components from dataX.transform(X): Apply the dimensionality reduction toX.fit_transform(X): Combinefitandtransformin one step.
python
from sklearn.decomposition import PCA pca = PCA(n_components=2) # keep 2 principal components pca.fit(X) # learn components from data X X_reduced = pca.transform(X) # reduce dimensions of X # Or simply: X_reduced = pca.fit_transform(X)
Example
This example shows how to reduce a 4-feature dataset to 2 principal components using PCA from sklearn.
python
from sklearn.decomposition import PCA from sklearn.datasets import load_iris # Load sample data data = load_iris() X = data.data # 150 samples, 4 features # Create PCA object to keep 2 components pca = PCA(n_components=2) # Fit PCA and transform data X_pca = pca.fit_transform(X) # Print shape before and after print(f"Original shape: {X.shape}") print(f"Reduced shape: {X_pca.shape}") # Print explained variance ratio print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
Output
Original shape: (150, 4)
Reduced shape: (150, 2)
Explained variance ratio: [0.92461872 0.05306648]
Common Pitfalls
- Not scaling data: PCA works best when features are on similar scales. Use
StandardScalerbefore PCA if needed. - Choosing wrong
n_components: Too few components lose important info; too many keep noise. - Confusing
fitandfit_transform: Usefit_transformto reduce data in one step.
python
from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA # Wrong way: PCA without scaling pca = PCA(n_components=2) pca.fit(X) # X not scaled # Right way: scale first scaler = StandardScaler() X_scaled = scaler.fit_transform(X) pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled)
Quick Reference
| Parameter | Description |
|---|---|
| n_components | Number of principal components to keep (int or float for variance ratio) |
| fit(X) | Compute principal components from data X |
| transform(X) | Apply dimensionality reduction to X |
| fit_transform(X) | Fit PCA and transform X in one step |
| explained_variance_ratio_ | Percentage of variance explained by each component |
Key Takeaways
Always create a PCA object with the desired number of components using n_components.
Use fit_transform to both learn and apply PCA to your data in one step.
Scale your data before PCA to get meaningful results.
Check explained_variance_ratio_ to understand how much info your components keep.
Choosing the right number of components balances data simplification and information loss.