Bird
Raised Fist0
NLPml~20 mins

Visualizing embeddings (t-SNE) in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Visualizing embeddings (t-SNE)
Problem:You have word embeddings from a small text dataset. You want to visualize these embeddings to understand how similar words group together.
Current Metrics:No quantitative metrics; current state is raw embeddings without visualization.
Issue:Without visualization, it's hard to see relationships or clusters in the embeddings.
Your Task
Create a 2D visualization of word embeddings using t-SNE to reveal clusters of similar words.
Use t-SNE for dimensionality reduction.
Use matplotlib for plotting.
Use a small set of word embeddings (e.g., from pretrained GloVe or a small custom set).
Hint 1
Hint 2
Hint 3
Solution
NLP
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# Sample word embeddings (5 words, 50-dimensional vectors simulated)
words = ['cat', 'dog', 'apple', 'orange', 'car']
embeddings = np.array([
    np.random.normal(0, 1, 50) + 1,  # cat
    np.random.normal(0, 1, 50) + 1,  # dog
    np.random.normal(0, 1, 50) - 1,  # apple
    np.random.normal(0, 1, 50) - 1,  # orange
    np.random.normal(0, 1, 50) + 3   # car
])

# Normalize embeddings
embeddings_norm = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings_norm)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], color='blue')
for i, word in enumerate(words):
    plt.text(embeddings_2d[i, 0] + 0.01, embeddings_2d[i, 1] + 0.01, word, fontsize=12)
plt.title('t-SNE visualization of word embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.grid(True)
plt.show()
Created a small set of sample word embeddings with simulated vectors.
Normalized embeddings to unit length for better t-SNE performance.
Applied t-SNE to reduce 50D embeddings to 2D.
Plotted the 2D points with word labels using matplotlib.
Results Interpretation

Before: Raw 50-dimensional embeddings with no visual insight.
After: 2D plot showing word clusters, making relationships visible.

t-SNE helps us see how similar words group together by reducing high-dimensional embeddings to 2D, making patterns easier to understand.
Bonus Experiment
Try visualizing embeddings from a larger set of words and compare t-SNE with PCA for dimensionality reduction.
💡 Hint
Use sklearn's PCA and t-SNE on the same embeddings and plot side-by-side to see differences in cluster separation.

Practice

(1/5)
1. What is the main purpose of using t-SNE in visualizing word embeddings?
easy
A. To train word embeddings from raw text data
B. To increase the size of word embeddings for better accuracy
C. To reduce high-dimensional word vectors into 2D or 3D for easy visualization
D. To cluster words based on their frequency in the text

Solution

  1. Step 1: Understand t-SNE's role in dimensionality reduction

    t-SNE reduces complex, high-dimensional data like word embeddings into 2D or 3D space for visualization.
  2. Step 2: Differentiate from other tasks

    It does not train embeddings or cluster by frequency but helps visualize similarity by reducing dimensions.
  3. Final Answer:

    To reduce high-dimensional word vectors into 2D or 3D for easy visualization -> Option C
  4. Quick Check:

    t-SNE = dimensionality reduction for visualization [OK]
Hint: t-SNE = reduce dimensions to visualize complex data [OK]
Common Mistakes:
  • Confusing t-SNE with training embeddings
  • Thinking t-SNE increases data size
  • Assuming t-SNE clusters by word frequency
2. Which of the following is the correct way to import t-SNE from scikit-learn in Python?
easy
A. from sklearn.manifold import TSNE
B. import sklearn.tsne as TSNE
C. from sklearn.embedding import tSNE
D. import tsne from sklearn

Solution

  1. Step 1: Recall correct module for t-SNE in scikit-learn

    t-SNE is in the sklearn.manifold module and is imported as TSNE.
  2. Step 2: Check syntax correctness

    from sklearn.manifold import TSNE uses correct syntax: from sklearn.manifold import TSNE. Others are invalid imports.
  3. Final Answer:

    from sklearn.manifold import TSNE -> Option A
  4. Quick Check:

    Correct import = from sklearn.manifold import TSNE [OK]
Hint: t-SNE is in sklearn.manifold, import as TSNE [OK]
Common Mistakes:
  • Using wrong module like sklearn.embedding
  • Incorrect import syntax
  • Confusing lowercase and uppercase in import
3. Given this Python code snippet using t-SNE, what will be the shape of embeddings_2d?
from sklearn.manifold import TSNE
import numpy as np

embeddings = np.random.rand(100, 50)  # 100 words, 50 dimensions
model = TSNE(n_components=2, random_state=42)
embeddings_2d = model.fit_transform(embeddings)
medium
A. (100, 2)
B. (2, 100)
C. (50, 2)
D. (100, 50)

Solution

  1. Step 1: Understand input shape and t-SNE output

    Input embeddings have shape (100, 50) meaning 100 samples with 50 features each.
  2. Step 2: Check t-SNE output shape with n_components=2

    t-SNE reduces features to 2 dimensions, so output shape is (100, 2) -- 100 samples, 2 features.
  3. Final Answer:

    (100, 2) -> Option A
  4. Quick Check:

    Output shape = (samples, n_components) = (100, 2) [OK]
Hint: Output shape = (samples, n_components) in t-SNE [OK]
Common Mistakes:
  • Confusing rows and columns in output shape
  • Assuming output shape equals input shape
  • Mixing up n_components with sample count
4. You run t-SNE on word embeddings but get a ValueError: "perplexity must be less than n_samples". What is the likely cause and fix?
medium
A. Input embeddings have wrong shape; reshape to (features, samples)
B. Perplexity is set too high; reduce it below the number of samples
C. Random state is missing; add random_state parameter
D. t-SNE requires normalized data; normalize embeddings first

Solution

  1. Step 1: Understand perplexity parameter in t-SNE

    Perplexity controls neighborhood size and must be less than the number of samples.
  2. Step 2: Identify cause of ValueError

    Error means perplexity is set equal or larger than sample count, which is invalid.
  3. Step 3: Fix by lowering perplexity

    Reduce perplexity to a value smaller than the number of samples to fix the error.
  4. Final Answer:

    Perplexity is set too high; reduce it below the number of samples -> Option B
  5. Quick Check:

    Perplexity < n_samples to avoid error [OK]
Hint: Keep perplexity less than sample count in t-SNE [OK]
Common Mistakes:
  • Changing input shape instead of perplexity
  • Ignoring the perplexity limit
  • Assuming normalization fixes this error
5. You want to visualize embeddings of 5000 words using t-SNE but notice the plot is very crowded and unclear. Which approach best improves visualization clarity?
hard
A. Apply t-SNE with n_components=50 to keep more dimensions
B. Increase perplexity to a very high value like 1000 to spread points out
C. Use raw high-dimensional embeddings without dimensionality reduction
D. Reduce the number of words by selecting a smaller subset before applying t-SNE

Solution

  1. Step 1: Understand t-SNE limitations with large datasets

    t-SNE works best with small to medium data; large sets cause crowded plots and slow computation.
  2. Step 2: Choose practical solution for clarity

    Reducing the dataset size by selecting fewer words improves plot clarity and speed.
  3. Step 3: Evaluate other options

    Increasing perplexity too high or keeping many dimensions defeats t-SNE's purpose; raw embeddings are hard to visualize.
  4. Final Answer:

    Reduce the number of words by selecting a smaller subset before applying t-SNE -> Option D
  5. Quick Check:

    Smaller data = clearer t-SNE plots [OK]
Hint: Use smaller data subsets for clearer t-SNE plots [OK]
Common Mistakes:
  • Setting perplexity too high
  • Using too many dimensions in t-SNE
  • Trying to visualize raw embeddings directly