0
0
MlopsHow-ToBeginner · 3 min read

How to Use t-SNE in Python with sklearn

Use TSNE from sklearn.manifold to reduce high-dimensional data to 2 or 3 dimensions for visualization. Fit your data with fit_transform() to get the low-dimensional embedding. Customize parameters like n_components and perplexity to improve results.
📐

Syntax

The basic syntax to use t-SNE in Python with sklearn is:

  • TSNE(n_components=2, perplexity=30.0, learning_rate=200.0, n_iter=1000, random_state=None): Creates the t-SNE model.
  • fit_transform(X): Fits the model to data X and returns the transformed low-dimensional data.

Key parameters:

  • n_components: Number of dimensions for output (usually 2 or 3).
  • perplexity: Balances attention between local and global aspects of data (typical values 5-50).
  • learning_rate: Controls the speed of optimization.
  • n_iter: Number of iterations for optimization.
  • random_state: Seed for reproducibility.
python
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42)
X_embedded = tsne.fit_transform(X)
💻

Example

This example shows how to apply t-SNE to the famous Iris dataset to reduce its 4 features to 2 dimensions for visualization.

python
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load Iris data
iris = load_iris()
X = iris.data
labels = iris.target

# Create and fit t-SNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42)
X_embedded = tsne.fit_transform(X)

# Plot the result
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_embedded[:,0], X_embedded[:,1], c=labels, cmap='viridis')
plt.legend(handles=scatter.legend_elements()[0], labels=iris.target_names)
plt.title('t-SNE visualization of Iris dataset')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()
Output
A scatter plot window showing 2D clusters of Iris species colored differently.
⚠️

Common Pitfalls

Common mistakes when using t-SNE include:

  • Using very high perplexity values (above 50) which can cause poor clustering.
  • Not setting random_state for reproducible results.
  • Applying t-SNE directly on very large datasets without sampling, which is slow.
  • Misinterpreting t-SNE distances as exact metrics; it preserves local structure but not global distances.

Always preprocess data (e.g., scale features) before applying t-SNE for better results.

python
from sklearn.manifold import TSNE

# Wrong: Very high perplexity
# tsne = TSNE(perplexity=100)  # Can cause poor results

# Right: Use moderate perplexity and set random_state
tsne = TSNE(perplexity=30, random_state=42)
📊

Quick Reference

ParameterDescriptionTypical Values
n_componentsOutput dimension count2 or 3
perplexityBalance between local/global structure5 to 50
learning_rateOptimization speed10 to 1000 (default 200)
n_iterNumber of optimization iterations250 to 1000+
random_stateSeed for reproducibilityAny integer

Key Takeaways

Use sklearn.manifold.TSNE with fit_transform to reduce data dimensions.
Set n_components to 2 or 3 for visualization purposes.
Choose perplexity between 5 and 50 for best clustering results.
Set random_state for reproducible embeddings.
Preprocess data and avoid very large datasets without sampling.