How to use t-SNE python

MlopsHow-ToBeginner · 3 min read

How to Use t-SNE in Python with sklearn

Use TSNE from sklearn.manifold to reduce high-dimensional data to 2 or 3 dimensions for visualization. Fit your data with fit_transform() to get the low-dimensional embedding. Customize parameters like n_components and perplexity to improve results.

📐

Syntax

The basic syntax to use t-SNE in Python with sklearn is:

TSNE(n_components=2, perplexity=30.0, learning_rate=200.0, n_iter=1000, random_state=None): Creates the t-SNE model.
fit_transform(X): Fits the model to data X and returns the transformed low-dimensional data.

Key parameters:

n_components: Number of dimensions for output (usually 2 or 3).
perplexity: Balances attention between local and global aspects of data (typical values 5-50).
learning_rate: Controls the speed of optimization.
n_iter: Number of iterations for optimization.
random_state: Seed for reproducibility.

python

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42)
X_embedded = tsne.fit_transform(X)

💻

Example

This example shows how to apply t-SNE to the famous Iris dataset to reduce its 4 features to 2 dimensions for visualization.

python

from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load Iris data
iris = load_iris()
X = iris.data
labels = iris.target

# Create and fit t-SNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42)
X_embedded = tsne.fit_transform(X)

# Plot the result
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_embedded[:,0], X_embedded[:,1], c=labels, cmap='viridis')
plt.legend(handles=scatter.legend_elements()[0], labels=iris.target_names)
plt.title('t-SNE visualization of Iris dataset')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()

Output

A scatter plot window showing 2D clusters of Iris species colored differently.

⚠️

Common Pitfalls

Common mistakes when using t-SNE include:

Using very high perplexity values (above 50) which can cause poor clustering.
Not setting random_state for reproducible results.
Applying t-SNE directly on very large datasets without sampling, which is slow.
Misinterpreting t-SNE distances as exact metrics; it preserves local structure but not global distances.

Always preprocess data (e.g., scale features) before applying t-SNE for better results.

python

from sklearn.manifold import TSNE

# Wrong: Very high perplexity
# tsne = TSNE(perplexity=100)  # Can cause poor results

# Right: Use moderate perplexity and set random_state
tsne = TSNE(perplexity=30, random_state=42)

📊

Quick Reference

Parameter	Description	Typical Values
n_components	Output dimension count	2 or 3
perplexity	Balance between local/global structure	5 to 50
learning_rate	Optimization speed	10 to 1000 (default 200)
n_iter	Number of optimization iterations	250 to 1000+
random_state	Seed for reproducibility	Any integer

✅

Key Takeaways

Use sklearn.manifold.TSNE with fit_transform to reduce data dimensions.

Set n_components to 2 or 3 for visualization purposes.

Choose perplexity between 5 and 50 for best clustering results.

Set random_state for reproducible embeddings.

Preprocess data and avoid very large datasets without sampling.