Bird
Raised Fist0
NLPml~5 mins

Visualizing embeddings (t-SNE) in NLP

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction

We use t-SNE to turn complex word or sentence numbers into pictures. This helps us see how similar or different words are in a simple way.

You want to see how words group together by meaning.
You have many word numbers and want to show them in 2D or 3D.
You want to check if your word model learned good relationships.
You want to explain word similarities to friends or teammates.
Syntax
NLP
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings)

n_components sets the output dimension (2D or 3D).

perplexity balances attention between local and global data structure.

Examples
Basic 2D t-SNE visualization of embeddings.
NLP
tsne = TSNE(n_components=2)
embeddings_2d = tsne.fit_transform(embeddings)
3D t-SNE with higher perplexity for more global structure.
NLP
tsne = TSNE(n_components=3, perplexity=40)
embeddings_3d = tsne.fit_transform(embeddings)
Sample Model

This code shows how to use t-SNE to turn 4D word embeddings into 2D points. It prints the new 2D points and draws a simple plot with word labels.

NLP
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Sample word embeddings for 5 words (random for demo)
embeddings = np.array([
    [0.1, 0.3, 0.5, 0.7],  # word1
    [0.2, 0.1, 0.4, 0.6],  # word2
    [0.9, 0.8, 0.7, 0.6],  # word3
    [0.85, 0.75, 0.65, 0.55],  # word4
    [0.15, 0.25, 0.35, 0.45]   # word5
])

# Create t-SNE model
tsne = TSNE(n_components=2, random_state=0)

# Transform embeddings to 2D
embeddings_2d = tsne.fit_transform(embeddings)

# Print 2D embeddings
print('2D embeddings:')
print(embeddings_2d)

# Plot the 2D embeddings
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
for i, txt in enumerate(['word1', 'word2', 'word3', 'word4', 'word5']):
    plt.annotate(txt, (embeddings_2d[i, 0], embeddings_2d[i, 1]))
plt.title('t-SNE visualization of word embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.grid(True)
plt.show()
OutputSuccess
Important Notes

t-SNE is slow for very large data sets; use a smaller sample or other methods for big data.

Results can change each run unless you set random_state for repeatability.

t-SNE shows relative distances, not exact numbers, so use it to explore patterns, not exact values.

Summary

t-SNE helps turn complex word numbers into easy-to-see pictures.

You can use it to check if words with similar meanings group together.

It works best with small to medium data and needs some tuning like perplexity.

Practice

(1/5)
1. What is the main purpose of using t-SNE in visualizing word embeddings?
easy
A. To train word embeddings from raw text data
B. To increase the size of word embeddings for better accuracy
C. To reduce high-dimensional word vectors into 2D or 3D for easy visualization
D. To cluster words based on their frequency in the text

Solution

  1. Step 1: Understand t-SNE's role in dimensionality reduction

    t-SNE reduces complex, high-dimensional data like word embeddings into 2D or 3D space for visualization.
  2. Step 2: Differentiate from other tasks

    It does not train embeddings or cluster by frequency but helps visualize similarity by reducing dimensions.
  3. Final Answer:

    To reduce high-dimensional word vectors into 2D or 3D for easy visualization -> Option C
  4. Quick Check:

    t-SNE = dimensionality reduction for visualization [OK]
Hint: t-SNE = reduce dimensions to visualize complex data [OK]
Common Mistakes:
  • Confusing t-SNE with training embeddings
  • Thinking t-SNE increases data size
  • Assuming t-SNE clusters by word frequency
2. Which of the following is the correct way to import t-SNE from scikit-learn in Python?
easy
A. from sklearn.manifold import TSNE
B. import sklearn.tsne as TSNE
C. from sklearn.embedding import tSNE
D. import tsne from sklearn

Solution

  1. Step 1: Recall correct module for t-SNE in scikit-learn

    t-SNE is in the sklearn.manifold module and is imported as TSNE.
  2. Step 2: Check syntax correctness

    from sklearn.manifold import TSNE uses correct syntax: from sklearn.manifold import TSNE. Others are invalid imports.
  3. Final Answer:

    from sklearn.manifold import TSNE -> Option A
  4. Quick Check:

    Correct import = from sklearn.manifold import TSNE [OK]
Hint: t-SNE is in sklearn.manifold, import as TSNE [OK]
Common Mistakes:
  • Using wrong module like sklearn.embedding
  • Incorrect import syntax
  • Confusing lowercase and uppercase in import
3. Given this Python code snippet using t-SNE, what will be the shape of embeddings_2d?
from sklearn.manifold import TSNE
import numpy as np

embeddings = np.random.rand(100, 50)  # 100 words, 50 dimensions
model = TSNE(n_components=2, random_state=42)
embeddings_2d = model.fit_transform(embeddings)
medium
A. (100, 2)
B. (2, 100)
C. (50, 2)
D. (100, 50)

Solution

  1. Step 1: Understand input shape and t-SNE output

    Input embeddings have shape (100, 50) meaning 100 samples with 50 features each.
  2. Step 2: Check t-SNE output shape with n_components=2

    t-SNE reduces features to 2 dimensions, so output shape is (100, 2) -- 100 samples, 2 features.
  3. Final Answer:

    (100, 2) -> Option A
  4. Quick Check:

    Output shape = (samples, n_components) = (100, 2) [OK]
Hint: Output shape = (samples, n_components) in t-SNE [OK]
Common Mistakes:
  • Confusing rows and columns in output shape
  • Assuming output shape equals input shape
  • Mixing up n_components with sample count
4. You run t-SNE on word embeddings but get a ValueError: "perplexity must be less than n_samples". What is the likely cause and fix?
medium
A. Input embeddings have wrong shape; reshape to (features, samples)
B. Perplexity is set too high; reduce it below the number of samples
C. Random state is missing; add random_state parameter
D. t-SNE requires normalized data; normalize embeddings first

Solution

  1. Step 1: Understand perplexity parameter in t-SNE

    Perplexity controls neighborhood size and must be less than the number of samples.
  2. Step 2: Identify cause of ValueError

    Error means perplexity is set equal or larger than sample count, which is invalid.
  3. Step 3: Fix by lowering perplexity

    Reduce perplexity to a value smaller than the number of samples to fix the error.
  4. Final Answer:

    Perplexity is set too high; reduce it below the number of samples -> Option B
  5. Quick Check:

    Perplexity < n_samples to avoid error [OK]
Hint: Keep perplexity less than sample count in t-SNE [OK]
Common Mistakes:
  • Changing input shape instead of perplexity
  • Ignoring the perplexity limit
  • Assuming normalization fixes this error
5. You want to visualize embeddings of 5000 words using t-SNE but notice the plot is very crowded and unclear. Which approach best improves visualization clarity?
hard
A. Apply t-SNE with n_components=50 to keep more dimensions
B. Increase perplexity to a very high value like 1000 to spread points out
C. Use raw high-dimensional embeddings without dimensionality reduction
D. Reduce the number of words by selecting a smaller subset before applying t-SNE

Solution

  1. Step 1: Understand t-SNE limitations with large datasets

    t-SNE works best with small to medium data; large sets cause crowded plots and slow computation.
  2. Step 2: Choose practical solution for clarity

    Reducing the dataset size by selecting fewer words improves plot clarity and speed.
  3. Step 3: Evaluate other options

    Increasing perplexity too high or keeping many dimensions defeats t-SNE's purpose; raw embeddings are hard to visualize.
  4. Final Answer:

    Reduce the number of words by selecting a smaller subset before applying t-SNE -> Option D
  5. Quick Check:

    Smaller data = clearer t-SNE plots [OK]
Hint: Use smaller data subsets for clearer t-SNE plots [OK]
Common Mistakes:
  • Setting perplexity too high
  • Using too many dimensions in t-SNE
  • Trying to visualize raw embeddings directly