MlopsHow-ToBeginner · 4 min read

How to Visualize High Dimensional Data in Python with sklearn

To visualize high dimensional data in Python, use sklearn tools like PCA (Principal Component Analysis) or TSNE (t-distributed Stochastic Neighbor Embedding) to reduce dimensions to 2 or 3. Then plot the reduced data using libraries like matplotlib or seaborn for easy interpretation.

📐

Syntax

Use PCA or TSNE from sklearn.manifold or sklearn.decomposition to reduce data dimensions. Then plot with matplotlib.pyplot.

PCA(n_components=2): reduces data to 2 dimensions.
TSNE(n_components=2, perplexity=30): reduces data non-linearly, good for complex data.
fit_transform(X): applies the reduction on data X.

python

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# PCA example
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# TSNE example
tsne = TSNE(n_components=2, perplexity=30)
X_embedded = tsne.fit_transform(X)

# Plotting
plt.scatter(X_reduced[:, 0], X_reduced[:, 1])
plt.show()

💻

Example

This example shows how to reduce the famous Iris dataset from 4 dimensions to 2 using PCA and then plot it with colors for each species.

python

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Reduce dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot
plt.figure(figsize=(8,6))
for target in set(y):
    plt.scatter(X_pca[y == target, 0], X_pca[y == target, 1], label=iris.target_names[target])
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.show()

Output

A scatter plot with three colored clusters labeled setosa, versicolor, and virginica along two PCA components.

⚠️

Common Pitfalls

Common mistakes include:

Not scaling data before PCA or t-SNE, which can distort results.
Using too high perplexity in t-SNE causing poor visualization.
Trying to visualize too many points without sampling, leading to cluttered plots.
Confusing PCA (linear) with t-SNE (non-linear) and choosing the wrong method for your data.

python

from sklearn.preprocessing import StandardScaler

# Wrong: PCA without scaling
pca_wrong = PCA(n_components=2)
X_pca_wrong = pca_wrong.fit_transform(X)  # May give misleading results

# Right: Scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca_right = PCA(n_components=2)
X_pca_right = pca_right.fit_transform(X_scaled)

📊

Quick Reference

Method	Use Case	Key Parameter	Notes
PCA	Linear dimension reduction	n_components (e.g., 2)	Good for linearly separable data
t-SNE	Non-linear dimension reduction	perplexity (5-50)	Better for complex clusters, slower
UMAP (optional)	Fast non-linear reduction	n_neighbors, min_dist	Alternative to t-SNE, preserves global structure
Scaling	Preprocessing step	StandardScaler()	Always scale data before PCA or t-SNE

✅

Key Takeaways

Use PCA or t-SNE from sklearn to reduce high dimensional data to 2 or 3 dimensions for visualization.

Always scale your data before applying PCA or t-SNE to get meaningful results.

Choose PCA for linear data structure and t-SNE for complex, non-linear patterns.

Plot the reduced data using matplotlib or seaborn for clear visual insights.

Avoid clutter by sampling large datasets before visualization.