PCA vs t-SNE in Python: Key Differences and Usage
PCA is a linear method that reduces dimensions by preserving global variance, while t-SNE is a nonlinear technique focused on preserving local data structure for visualization. Both are available in sklearn but serve different purposes in data analysis.Quick Comparison
Here is a quick side-by-side comparison of PCA and t-SNE highlighting their main characteristics.
| Factor | PCA | t-SNE |
|---|---|---|
| Type | Linear dimensionality reduction | Nonlinear dimensionality reduction |
| Goal | Preserve global variance | Preserve local neighbor distances |
| Output | Continuous components | Clustered visualization |
| Speed | Fast on large datasets | Slower, computationally intensive |
| Interpretability | Components are linear combinations | No explicit components, only embeddings |
| Use case | Feature reduction, preprocessing | Data visualization, cluster exploration |
Key Differences
PCA works by finding directions (called principal components) that capture the most variance in the data. It is a linear method, meaning it assumes data lies on a flat plane and reduces dimensions by projecting data onto these directions. This makes PCA fast and useful for preprocessing before other algorithms.
t-SNE, on the other hand, is a nonlinear technique designed mainly for visualization. It converts high-dimensional distances into probabilities and tries to keep similar points close in a low-dimensional space. This preserves local structure but can distort global relationships. t-SNE is slower and mainly used to explore clusters visually.
In sklearn, PCA is implemented as sklearn.decomposition.PCA and t-SNE as sklearn.manifold.TSNE. Choosing between them depends on whether you want fast feature reduction (PCA) or detailed visualization of clusters (t-SNE).
Code Comparison
Below is an example of using PCA in Python with sklearn to reduce the famous Iris dataset to 2 dimensions.
from sklearn.datasets import load_iris from sklearn.decomposition import PCA import matplotlib.pyplot as plt # Load data iris = load_iris() X = iris.data def plot_pca(): pca = PCA(n_components=2) X_pca = pca.fit_transform(X) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target, cmap='viridis') plt.xlabel('PC1') plt.ylabel('PC2') plt.title('PCA of Iris Dataset') plt.colorbar() plt.show() plot_pca()
t-SNE Equivalent
Here is how to use t-SNE on the same Iris dataset to visualize clusters in 2D.
from sklearn.datasets import load_iris from sklearn.manifold import TSNE import matplotlib.pyplot as plt # Load data iris = load_iris() X = iris.data def plot_tsne(): tsne = TSNE(n_components=2, random_state=42) X_tsne = tsne.fit_transform(X) plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='viridis') plt.xlabel('t-SNE dim 1') plt.ylabel('t-SNE dim 2') plt.title('t-SNE of Iris Dataset') plt.colorbar() plt.show() plot_tsne()
When to Use Which
Choose PCA when you need a fast, simple way to reduce dimensions for preprocessing or to understand global variance in your data. It works well for linear relationships and large datasets.
Choose t-SNE when your goal is to visualize complex, nonlinear structures and clusters in your data. It is best for small to medium datasets where local relationships matter more than global structure.
In summary, use PCA for feature reduction and t-SNE for detailed visualization.