When we use t-SNE to visualize embeddings, we want to see if similar items group together clearly. The main "metric" is how well the visualization shows clusters or patterns that match what we expect. This is not a number like accuracy but a visual check of neighborhood preservation. We look for tight groups of similar points and clear separation between different groups.
Visualizing embeddings (t-SNE) in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
t-SNE does not produce a confusion matrix because it is for visualization, not classification. Instead, we look at a 2D or 3D scatter plot of points representing embeddings. Points close together mean similar data. For example:
Class A: ● ● ● Class B: ○ ○ ○
● ○
Class A points cluster tightly, separate from Class B points.
This visual grouping helps us understand if embeddings capture meaningful differences.
t-SNE visualization is not about precision or recall. But there is a tradeoff in how t-SNE balances preserving local vs global structure:
- Local structure: t-SNE tries to keep similar points close. This helps see small clusters clearly.
- Global structure: t-SNE may distort distances between big groups to keep local neighborhoods intact.
For example, if you want to see small groups of similar words, focus on local structure. If you want to see overall group relations, t-SNE might not show that well.
Since t-SNE is visual, "good" means:
- Clear clusters of points that match known categories or labels.
- Minimal overlap between different groups.
- Consistent grouping when running t-SNE multiple times (with same parameters and random seed).
"Bad" means:
- Points from different groups mixed randomly.
- No visible clusters or patterns.
- Very different results each time you run t-SNE.
- Over-interpretation: t-SNE plots look nice but do not prove model quality. They are just a tool to explore data.
- Randomness: t-SNE uses randomness. Different runs can look different unless you fix the random seed.
- Parameter sensitivity: Perplexity and learning rate affect results a lot. Wrong settings can hide true structure.
- Global structure distortion: t-SNE focuses on local neighborhoods, so distances between clusters may not be meaningful.
- Data leakage: Visualizing embeddings from training data only can hide problems. Always check embeddings on new data too.
Your t-SNE plot shows three clear clusters matching your known categories, but when you run it again with a different random seed, the clusters look different. Is your visualization reliable? What should you do?
Answer: The visualization is not fully reliable because t-SNE randomness changes results. You should fix the random seed to get consistent plots. Also, try different parameters and check if clusters remain stable. This helps confirm the embeddings truly capture meaningful groups.
Practice
t-SNE in visualizing word embeddings?Solution
Step 1: Understand t-SNE's role in dimensionality reduction
t-SNE reduces complex, high-dimensional data like word embeddings into 2D or 3D space for visualization.Step 2: Differentiate from other tasks
It does not train embeddings or cluster by frequency but helps visualize similarity by reducing dimensions.Final Answer:
To reduce high-dimensional word vectors into 2D or 3D for easy visualization -> Option CQuick Check:
t-SNE = dimensionality reduction for visualization [OK]
- Confusing t-SNE with training embeddings
- Thinking t-SNE increases data size
- Assuming t-SNE clusters by word frequency
Solution
Step 1: Recall correct module for t-SNE in scikit-learn
t-SNE is in the sklearn.manifold module and is imported as TSNE.Step 2: Check syntax correctness
from sklearn.manifold import TSNE uses correct syntax:from sklearn.manifold import TSNE. Others are invalid imports.Final Answer:
from sklearn.manifold import TSNE -> Option AQuick Check:
Correct import = from sklearn.manifold import TSNE [OK]
- Using wrong module like sklearn.embedding
- Incorrect import syntax
- Confusing lowercase and uppercase in import
embeddings_2d?
from sklearn.manifold import TSNE import numpy as np embeddings = np.random.rand(100, 50) # 100 words, 50 dimensions model = TSNE(n_components=2, random_state=42) embeddings_2d = model.fit_transform(embeddings)
Solution
Step 1: Understand input shape and t-SNE output
Input embeddings have shape (100, 50) meaning 100 samples with 50 features each.Step 2: Check t-SNE output shape with n_components=2
t-SNE reduces features to 2 dimensions, so output shape is (100, 2) -- 100 samples, 2 features.Final Answer:
(100, 2) -> Option AQuick Check:
Output shape = (samples, n_components) = (100, 2) [OK]
- Confusing rows and columns in output shape
- Assuming output shape equals input shape
- Mixing up n_components with sample count
Solution
Step 1: Understand perplexity parameter in t-SNE
Perplexity controls neighborhood size and must be less than the number of samples.Step 2: Identify cause of ValueError
Error means perplexity is set equal or larger than sample count, which is invalid.Step 3: Fix by lowering perplexity
Reduce perplexity to a value smaller than the number of samples to fix the error.Final Answer:
Perplexity is set too high; reduce it below the number of samples -> Option BQuick Check:
Perplexity < n_samples to avoid error [OK]
- Changing input shape instead of perplexity
- Ignoring the perplexity limit
- Assuming normalization fixes this error
Solution
Step 1: Understand t-SNE limitations with large datasets
t-SNE works best with small to medium data; large sets cause crowded plots and slow computation.Step 2: Choose practical solution for clarity
Reducing the dataset size by selecting fewer words improves plot clarity and speed.Step 3: Evaluate other options
Increasing perplexity too high or keeping many dimensions defeats t-SNE's purpose; raw embeddings are hard to visualize.Final Answer:
Reduce the number of words by selecting a smaller subset before applying t-SNE -> Option DQuick Check:
Smaller data = clearer t-SNE plots [OK]
- Setting perplexity too high
- Using too many dimensions in t-SNE
- Trying to visualize raw embeddings directly
