0
0
MlopsConceptBeginner · 3 min read

What is PCA in Machine Learning in Python | Explained Simply

In machine learning, PCA (Principal Component Analysis) is a technique to reduce the number of features in data by finding new combined features that keep most information. In Python, sklearn.decomposition.PCA helps apply PCA easily to transform data into fewer dimensions while preserving important patterns.
⚙️

How It Works

PCA works like finding the best new directions to look at your data so you can see the most important patterns with fewer details. Imagine you have a big messy photo with many colors and details, but you want to keep only the main shapes and colors that tell the story. PCA finds these main shapes by combining original features into new ones called principal components.

It does this by measuring how data points vary together and then creating new axes that capture the most variation. The first principal component captures the most variation, the second captures the next most but is different from the first, and so on. This way, you can keep just a few components instead of all original features, making data simpler and easier to work with.

💻

Example

This example shows how to use PCA in Python with sklearn to reduce a dataset with 4 features down to 2 principal components.

python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import numpy as np

# Load sample data
iris = load_iris()
X = iris.data

# Create PCA object to reduce to 2 components
pca = PCA(n_components=2)

# Fit PCA on data and transform it
X_pca = pca.fit_transform(X)

# Show shape before and after PCA
print(f"Original shape: {X.shape}")
print(f"Transformed shape: {X_pca.shape}")

# Show explained variance ratio
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
Output
Original shape: (150, 4) Transformed shape: (150, 2) Explained variance ratio: [0.92461872 0.05306648]
🎯

When to Use

Use PCA when you have data with many features and want to simplify it without losing much important information. It helps speed up machine learning models and makes visualization easier by reducing dimensions.

For example, PCA is useful in image processing to reduce pixel data, in finance to find main factors affecting stock prices, or in biology to analyze gene expression data. It is best when features are correlated and you want to remove redundancy.

Key Points

  • PCA reduces data dimensions by creating new combined features called principal components.
  • It keeps the most important information by capturing maximum variance.
  • Implemented in Python using sklearn.decomposition.PCA.
  • Helps improve speed and visualization in machine learning tasks.
  • Works best when original features are correlated.

Key Takeaways

PCA reduces feature count by combining them into principal components that keep most data variance.
Use sklearn's PCA to easily apply dimensionality reduction in Python.
PCA improves model speed and helps visualize high-dimensional data.
It is most effective when features have correlations and redundancy.
Always check explained variance to decide how many components to keep.