How to Use t-SNE in Python with sklearn
Use
TSNE from sklearn.manifold to reduce high-dimensional data to 2 or 3 dimensions for visualization. Fit your data with fit_transform() to get the low-dimensional embedding. Customize parameters like n_components and perplexity to improve results.Syntax
The basic syntax to use t-SNE in Python with sklearn is:
TSNE(n_components=2, perplexity=30.0, learning_rate=200.0, n_iter=1000, random_state=None): Creates the t-SNE model.fit_transform(X): Fits the model to dataXand returns the transformed low-dimensional data.
Key parameters:
n_components: Number of dimensions for output (usually 2 or 3).perplexity: Balances attention between local and global aspects of data (typical values 5-50).learning_rate: Controls the speed of optimization.n_iter: Number of iterations for optimization.random_state: Seed for reproducibility.
python
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42) X_embedded = tsne.fit_transform(X)
Example
This example shows how to apply t-SNE to the famous Iris dataset to reduce its 4 features to 2 dimensions for visualization.
python
from sklearn.datasets import load_iris from sklearn.manifold import TSNE import matplotlib.pyplot as plt # Load Iris data iris = load_iris() X = iris.data labels = iris.target # Create and fit t-SNE tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42) X_embedded = tsne.fit_transform(X) # Plot the result plt.figure(figsize=(8,6)) scatter = plt.scatter(X_embedded[:,0], X_embedded[:,1], c=labels, cmap='viridis') plt.legend(handles=scatter.legend_elements()[0], labels=iris.target_names) plt.title('t-SNE visualization of Iris dataset') plt.xlabel('Dimension 1') plt.ylabel('Dimension 2') plt.show()
Output
A scatter plot window showing 2D clusters of Iris species colored differently.
Common Pitfalls
Common mistakes when using t-SNE include:
- Using very high perplexity values (above 50) which can cause poor clustering.
- Not setting
random_statefor reproducible results. - Applying t-SNE directly on very large datasets without sampling, which is slow.
- Misinterpreting t-SNE distances as exact metrics; it preserves local structure but not global distances.
Always preprocess data (e.g., scale features) before applying t-SNE for better results.
python
from sklearn.manifold import TSNE # Wrong: Very high perplexity # tsne = TSNE(perplexity=100) # Can cause poor results # Right: Use moderate perplexity and set random_state tsne = TSNE(perplexity=30, random_state=42)
Quick Reference
| Parameter | Description | Typical Values |
|---|---|---|
| n_components | Output dimension count | 2 or 3 |
| perplexity | Balance between local/global structure | 5 to 50 |
| learning_rate | Optimization speed | 10 to 1000 (default 200) |
| n_iter | Number of optimization iterations | 250 to 1000+ |
| random_state | Seed for reproducibility | Any integer |
Key Takeaways
Use sklearn.manifold.TSNE with fit_transform to reduce data dimensions.
Set n_components to 2 or 3 for visualization purposes.
Choose perplexity between 5 and 50 for best clustering results.
Set random_state for reproducible embeddings.
Preprocess data and avoid very large datasets without sampling.