What is PCA in Machine Learning in Python | Explained Simply
PCA (Principal Component Analysis) is a technique to reduce the number of features in data by finding new combined features that keep most information. In Python, sklearn.decomposition.PCA helps apply PCA easily to transform data into fewer dimensions while preserving important patterns.How It Works
PCA works like finding the best new directions to look at your data so you can see the most important patterns with fewer details. Imagine you have a big messy photo with many colors and details, but you want to keep only the main shapes and colors that tell the story. PCA finds these main shapes by combining original features into new ones called principal components.
It does this by measuring how data points vary together and then creating new axes that capture the most variation. The first principal component captures the most variation, the second captures the next most but is different from the first, and so on. This way, you can keep just a few components instead of all original features, making data simpler and easier to work with.
Example
This example shows how to use PCA in Python with sklearn to reduce a dataset with 4 features down to 2 principal components.
from sklearn.decomposition import PCA from sklearn.datasets import load_iris import numpy as np # Load sample data iris = load_iris() X = iris.data # Create PCA object to reduce to 2 components pca = PCA(n_components=2) # Fit PCA on data and transform it X_pca = pca.fit_transform(X) # Show shape before and after PCA print(f"Original shape: {X.shape}") print(f"Transformed shape: {X_pca.shape}") # Show explained variance ratio print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
When to Use
Use PCA when you have data with many features and want to simplify it without losing much important information. It helps speed up machine learning models and makes visualization easier by reducing dimensions.
For example, PCA is useful in image processing to reduce pixel data, in finance to find main factors affecting stock prices, or in biology to analyze gene expression data. It is best when features are correlated and you want to remove redundancy.
Key Points
- PCA reduces data dimensions by creating new combined features called principal components.
- It keeps the most important information by capturing maximum variance.
- Implemented in Python using
sklearn.decomposition.PCA. - Helps improve speed and visualization in machine learning tasks.
- Works best when original features are correlated.