0
0
MlopsHow-ToBeginner · 4 min read

How to Visualize High Dimensional Data in Python with sklearn

To visualize high dimensional data in Python, use sklearn tools like PCA (Principal Component Analysis) or TSNE (t-distributed Stochastic Neighbor Embedding) to reduce dimensions to 2 or 3. Then plot the reduced data using libraries like matplotlib or seaborn for easy interpretation.
📐

Syntax

Use PCA or TSNE from sklearn.manifold or sklearn.decomposition to reduce data dimensions. Then plot with matplotlib.pyplot.

  • PCA(n_components=2): reduces data to 2 dimensions.
  • TSNE(n_components=2, perplexity=30): reduces data non-linearly, good for complex data.
  • fit_transform(X): applies the reduction on data X.
python
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# PCA example
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# TSNE example
tsne = TSNE(n_components=2, perplexity=30)
X_embedded = tsne.fit_transform(X)

# Plotting
plt.scatter(X_reduced[:, 0], X_reduced[:, 1])
plt.show()
💻

Example

This example shows how to reduce the famous Iris dataset from 4 dimensions to 2 using PCA and then plot it with colors for each species.

python
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Reduce dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot
plt.figure(figsize=(8,6))
for target in set(y):
    plt.scatter(X_pca[y == target, 0], X_pca[y == target, 1], label=iris.target_names[target])
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.show()
Output
A scatter plot with three colored clusters labeled setosa, versicolor, and virginica along two PCA components.
⚠️

Common Pitfalls

Common mistakes include:

  • Not scaling data before PCA or t-SNE, which can distort results.
  • Using too high perplexity in t-SNE causing poor visualization.
  • Trying to visualize too many points without sampling, leading to cluttered plots.
  • Confusing PCA (linear) with t-SNE (non-linear) and choosing the wrong method for your data.
python
from sklearn.preprocessing import StandardScaler

# Wrong: PCA without scaling
pca_wrong = PCA(n_components=2)
X_pca_wrong = pca_wrong.fit_transform(X)  # May give misleading results

# Right: Scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca_right = PCA(n_components=2)
X_pca_right = pca_right.fit_transform(X_scaled)
📊

Quick Reference

MethodUse CaseKey ParameterNotes
PCALinear dimension reductionn_components (e.g., 2)Good for linearly separable data
t-SNENon-linear dimension reductionperplexity (5-50)Better for complex clusters, slower
UMAP (optional)Fast non-linear reductionn_neighbors, min_distAlternative to t-SNE, preserves global structure
ScalingPreprocessing stepStandardScaler()Always scale data before PCA or t-SNE

Key Takeaways

Use PCA or t-SNE from sklearn to reduce high dimensional data to 2 or 3 dimensions for visualization.
Always scale your data before applying PCA or t-SNE to get meaningful results.
Choose PCA for linear data structure and t-SNE for complex, non-linear patterns.
Plot the reduced data using matplotlib or seaborn for clear visual insights.
Avoid clutter by sampling large datasets before visualization.