How to Use PCA in Python with sklearn: Simple Guide
Use
sklearn.decomposition.PCA to perform Principal Component Analysis in Python. Fit the PCA model on your data with fit() or fit_transform() to reduce dimensions and extract principal components.Syntax
The basic syntax to use PCA in Python with sklearn is:
PCA(n_components): Create a PCA object specifying how many principal components to keep.fit(X): Learn the principal components from dataX.transform(X): Apply the dimensionality reduction toX.fit_transform(X): Fit PCA and transformXin one step.
python
from sklearn.decomposition import PCA pca = PCA(n_components=2) # keep 2 principal components pca.fit(X) # learn components from data X X_reduced = pca.transform(X) # reduce dimensions of X # Or combine fit and transform X_reduced = pca.fit_transform(X)
Example
This example shows how to apply PCA to the Iris dataset to reduce its 4 features to 2 principal components and print the transformed data.
python
from sklearn.datasets import load_iris from sklearn.decomposition import PCA # Load Iris dataset iris = load_iris() X = iris.data # Create PCA object to reduce to 2 components pca = PCA(n_components=2) # Fit PCA and transform data X_pca = pca.fit_transform(X) # Print first 5 rows of transformed data print(X_pca[:5])
Output
[[-2.68412563 0.31939725]
[-2.71414169 -0.17700123]
[-2.88899057 -0.14494943]
[-2.74534286 -0.31829898]
[-2.72871654 0.32675451]]
Common Pitfalls
Common mistakes when using PCA include:
- Not scaling data before PCA, which can cause features with larger scales to dominate.
- Choosing too many or too few components without checking explained variance.
- Applying PCA on categorical data without encoding.
Always scale your data (e.g., with StandardScaler) before PCA for best results.
python
from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA # Wrong way: PCA without scaling pca = PCA(n_components=2) X_pca_wrong = pca.fit_transform(X) # May give misleading results # Right way: scale then PCA scaler = StandardScaler() X_scaled = scaler.fit_transform(X) pca = PCA(n_components=2) X_pca_right = pca.fit_transform(X_scaled) print('Explained variance ratio:', pca.explained_variance_ratio_)
Output
Explained variance ratio: [0.72770452 0.23030523]
Quick Reference
| Method | Description |
|---|---|
| PCA(n_components) | Create PCA model with number of components to keep |
| fit(X) | Compute principal components from data X |
| transform(X) | Apply dimensionality reduction to X |
| fit_transform(X) | Fit PCA and transform X in one step |
| explained_variance_ratio_ | Percentage of variance explained by each component |
Key Takeaways
Always scale your data before applying PCA for accurate results.
Use PCA to reduce data dimensions by keeping the most important components.
Check explained variance ratio to decide how many components to keep.
Use fit_transform() to fit PCA and reduce data in one step.
PCA works only on numerical data, so encode categorical features first.