Bird
Raised Fist0
ML Pythonml~8 mins

t-SNE for visualization in ML Python - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - t-SNE for visualization
Which metric matters for t-SNE visualization and WHY

t-SNE is a tool to show complex data in 2D or 3D so we can see patterns. It does not predict or classify, so usual accuracy metrics do not apply.

Instead, we look at how well t-SNE keeps similar points close and different points apart. This is often checked by visual inspection or by measuring trustworthiness and continuity scores.

These scores tell us if neighbors in the original data stay neighbors in the t-SNE map (trustworthiness), and if neighbors in the map were neighbors before (continuity).

Confusion matrix or equivalent visualization

t-SNE does not produce a confusion matrix because it is not a classifier.

Instead, we use a scatter plot showing points colored by their true group or label.

    Example t-SNE plot:

    +------------------------------------------------+
    |  *  *     *      *      *      *      *      *  |
    | * Group A points clustered together clearly    |
    |                                                |
    |        ++++++   ++++++   ++++++                |
    |        + Group B points clustered separately   |
    |                                                |
    |  o  o  o      o      o      o      o      o    |
    |  o Group C points scattered or overlapping     |
    +------------------------------------------------+
    

This visual helps us see if t-SNE separated groups well.

Precision vs Recall tradeoff (or equivalent) with concrete examples

t-SNE does not have precision or recall because it is not a classifier.

Instead, there is a tradeoff between local structure and global structure preservation.

  • Local structure: How well close points stay close. Good for seeing clusters.
  • Global structure: How well overall distances between clusters are kept. Good for understanding big picture.

For example, if you want to see tight clusters of similar items, focus on local structure (trustworthiness). If you want to see how clusters relate overall, focus on global structure (continuity).

What "good" vs "bad" metric values look like for t-SNE visualization

Good t-SNE visualization:

  • Clear, separate clusters matching known groups.
  • High trustworthiness score (close to 1.0), meaning neighbors are preserved.
  • Reasonable continuity score, meaning map neighbors reflect original neighbors.
  • Visual patterns that match what you expect from the data.

Bad t-SNE visualization:

  • Clusters overlap heavily or are mixed up.
  • Low trustworthiness (much less than 1.0), meaning neighbors are lost.
  • Map looks random or noisy with no clear groups.
  • Visual patterns contradict known labels or data structure.
Common pitfalls when evaluating t-SNE visualizations
  • Overinterpreting distances: Distances between clusters in t-SNE plots are not always meaningful globally.
  • Ignoring randomness: t-SNE uses randomness; different runs can look different. Always run multiple times.
  • Misleading clusters: t-SNE can create apparent clusters even if none exist in data.
  • Parameter sensitivity: Perplexity and learning rate affect results a lot; poor choices can ruin visualization.
  • Not checking trustworthiness: Without metrics, visual patterns can be misleading.
Self-check question

Your t-SNE plot shows three clear clusters matching your labels, but the trustworthiness score is 0.6 (low). Is this visualization reliable? Why or why not?

Answer: No, it is not fully reliable. The low trustworthiness means neighbors in the original data are not well preserved. The clusters might look clear but could be misleading. You should try different parameters or check other metrics before trusting the plot.

Key Result
t-SNE evaluation focuses on trustworthiness and continuity scores to ensure local and global data structure are preserved in visualization.

Practice

(1/5)
1. What is the main purpose of using t-SNE in machine learning?
easy
A. To increase the number of features in the dataset
B. To train a predictive model for classification
C. To visualize high-dimensional data in 2D or 3D to find patterns
D. To clean and preprocess data by removing missing values

Solution

  1. Step 1: Understand t-SNE's function

    t-SNE is a tool that reduces many features into 2 or 3 dimensions for easy visualization.
  2. Step 2: Identify its main use

    It helps us see groups or clusters in complex data, not to train models or clean data.
  3. Final Answer:

    To visualize high-dimensional data in 2D or 3D to find patterns -> Option C
  4. Quick Check:

    t-SNE = visualization tool [OK]
Hint: t-SNE = visualize complex data simply [OK]
Common Mistakes:
  • Thinking t-SNE trains prediction models
  • Confusing t-SNE with data cleaning methods
  • Assuming t-SNE increases feature count
2. Which of the following is the correct way to import t-SNE from scikit-learn in Python?
easy
A. from sklearn.manifold import TSNE
B. import tsne from sklearn
C. from sklearn.decomposition import TSNE
D. import TSNE from sklearn.manifold

Solution

  1. Step 1: Recall correct import syntax

    scikit-learn's t-SNE is in the manifold module and imported as TSNE.
  2. Step 2: Check each option

    from sklearn.manifold import TSNE uses correct Python import syntax and correct module. Others have wrong syntax or module.
  3. Final Answer:

    from sklearn.manifold import TSNE -> Option A
  4. Quick Check:

    Correct import = from sklearn.manifold import TSNE [OK]
Hint: t-SNE is in sklearn.manifold, import as TSNE [OK]
Common Mistakes:
  • Using wrong module like sklearn.decomposition
  • Incorrect import syntax causing errors
  • Confusing lowercase and uppercase in TSNE
3. What will be the shape of the output from the following code snippet?
from sklearn.manifold import TSNE
import numpy as np
X = np.random.rand(100, 50)
tsne = TSNE(n_components=2, random_state=42)
X_embedded = tsne.fit_transform(X)
print(X_embedded.shape)
medium
A. (50, 2)
B. (2, 100)
C. (100, 50)
D. (100, 2)

Solution

  1. Step 1: Understand input and t-SNE output

    Input X has 100 samples and 50 features. t-SNE reduces features to 2 dimensions.
  2. Step 2: Determine output shape

    Output shape is (number of samples, n_components) = (100, 2).
  3. Final Answer:

    (100, 2) -> Option D
  4. Quick Check:

    Output shape = (samples, components) [OK]
Hint: Output shape = (samples, n_components) [OK]
Common Mistakes:
  • Confusing features with samples in output shape
  • Swapping rows and columns in shape
  • Assuming output shape matches input shape
4. You run t-SNE on your dataset but get a ValueError: 'perplexity must be less than n_samples'. What is the likely cause and fix?
medium
A. Input data is not scaled; apply normalization
B. Perplexity is set too high; reduce it below number of samples
C. Random state is not set; set random_state parameter
D. Data contains missing values; remove or fill them

Solution

  1. Step 1: Understand the error message

    The error says perplexity must be less than number of samples, so perplexity is too large.
  2. Step 2: Fix by adjusting perplexity

    Reduce perplexity parameter to a value smaller than the number of samples in your data.
  3. Final Answer:

    Perplexity is set too high; reduce it below number of samples -> Option B
  4. Quick Check:

    Perplexity < samples [OK]
Hint: Keep perplexity less than sample count [OK]
Common Mistakes:
  • Ignoring perplexity limits and increasing it
  • Trying to fix by scaling data instead
  • Changing unrelated parameters like random_state
5. You have a dataset with 1000 samples and 100 features. You want to visualize it with t-SNE but also keep track of clusters found by KMeans. Which approach is best?
hard
A. Run KMeans first, then apply t-SNE on original data, color points by cluster
B. Apply t-SNE first, then run KMeans on the 2D t-SNE output
C. Use t-SNE only, no clustering needed for visualization
D. Run KMeans on original data and use PCA instead of t-SNE

Solution

  1. Step 1: Understand the goal

    You want to visualize data and show meaningful clusters clearly on the 2D plot.
  2. Step 2: Choose correct order

    Running KMeans first on high-dimensional data finds accurate clusters, then t-SNE visualizes them by coloring points by cluster labels.
  3. Step 3: Why not other options?

    Clustering on t-SNE output (B) is suboptimal as t-SNE distorts distances and is for visualization only, not modeling.
  4. Final Answer:

    Run KMeans first, then apply t-SNE on original data, color points by cluster -> Option A
  5. Quick Check:

    Cluster high-dim first, visualize after [OK]
Hint: Cluster original data first, then t-SNE visualize [OK]
Common Mistakes:
  • Clustering t-SNE output causing distorted clusters
  • Skipping clustering and missing group info
  • Using PCA instead of t-SNE unnecessarily