Embeddings turn words into numbers so computers can understand their meaning. They group similar words close together, showing their related ideas.
Why embeddings capture semantic meaning in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
embedding = Embedding(input_dim, output_dim) vector = embedding(word_index)
input_dim is the size of your vocabulary (number of unique words).
output_dim is the size of the vector that represents each word.
embedding = Embedding(10000, 50) vector = embedding(42)
embedding = Embedding(5000, 100) vector = embedding(7)
This code shows how embeddings represent words as vectors. It calculates similarity between related words. 'cat' and 'dog' are animals, so their vectors are closer. 'apple' and 'orange' are fruits, so their vectors are also close.
import numpy as np # Simple example of word embeddings using random vectors vocab = ['cat', 'dog', 'apple', 'orange'] # Assign random 3D vectors to each word embeddings = { 'cat': np.array([0.9, 0.1, 0.3]), 'dog': np.array([0.8, 0.2, 0.4]), 'apple': np.array([0.1, 0.9, 0.7]), 'orange': np.array([0.2, 0.8, 0.6]) } # Function to find similarity (cosine similarity) def cosine_similarity(vec1, vec2): return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) # Compare similarity between 'cat' and 'dog' sim_cat_dog = cosine_similarity(embeddings['cat'], embeddings['dog']) # Compare similarity between 'apple' and 'orange' sim_apple_orange = cosine_similarity(embeddings['apple'], embeddings['orange']) print(f"Similarity between 'cat' and 'dog': {sim_cat_dog:.2f}") print(f"Similarity between 'apple' and 'orange': {sim_apple_orange:.2f}")
Embeddings capture meaning because similar words appear in similar contexts, so their vectors become close.
Training embeddings on lots of text helps the model learn these relationships automatically.
Cosine similarity is a common way to measure how close two word vectors are.
Embeddings turn words into numbers that show their meaning.
Words with similar meanings have vectors close together.
This helps computers understand language better.
Practice
Solution
Step 1: Understand what embeddings do
Embeddings convert words into numbers (vectors) that represent their meanings.Step 2: Recognize the benefit for computers
These numbers help computers see which words are similar in meaning by their closeness in vector space.Final Answer:
Because they turn words into numbers that show their meaning -> Option AQuick Check:
Embeddings = numeric meaning representation [OK]
- Thinking embeddings translate languages
- Confusing embeddings with word frequency counts
- Believing embeddings remove words
Solution
Step 1: Identify the data type for embeddings
Embeddings are numeric vectors, usually lists or arrays of floats.Step 2: Check each option's format
embedding = [0.1, 0.5, -0.3]shows a list of numbers, which is correct. Others are strings, integers, or dictionaries, which are incorrect.Final Answer:
embedding = [0.1, 0.5, -0.3]-> Option DQuick Check:
Embedding vector = list of numbers [OK]
- Using strings instead of numeric vectors
- Using single numbers instead of vectors
- Using dictionaries instead of lists
embedding_cat = [0.2, 0.4, 0.6]embedding_dog = [0.21, 0.39, 0.58]embedding_car = [0.9, 0.1, 0.2]Which pair is most semantically similar based on cosine similarity?
Solution
Step 1: Understand cosine similarity
Cosine similarity measures how close two vectors point in the same direction; higher means more similar.Step 2: Compare vectors
embedding_cat and embedding_dog are close numerically, so their cosine similarity is high. embedding_car is quite different.Final Answer:
cat and dog -> Option CQuick Check:
Closest vectors = most similar words [OK]
- Assuming car is similar to cat or dog
- Thinking all pairs have same similarity
- Ignoring vector closeness
def similarity(vec1, vec2):
return sum(a*b for a, b in zip(vec1, vec2))
embedding1 = [0.3, 0.5, 0.2]
embedding2 = [0.3, 0.5]
print(similarity(embedding1, embedding2))What is the main problem here?
Solution
Step 1: Check vector lengths
embedding1 has 3 elements, embedding2 has 2 elements, so zip stops early, ignoring last element of embedding1.Step 2: Understand impact on similarity
This causes incomplete calculation and inaccurate similarity score.Final Answer:
The vectors have different lengths causing incorrect similarity -> Option AQuick Check:
Vector length mismatch = wrong similarity [OK]
- Ignoring vector length mismatch
- Thinking sum is wrong operation here
- Expecting list output instead of number
Solution
Step 1: Understand sentence embedding from word embeddings
Averaging pretrained word embeddings creates a vector representing the whole sentence's meaning.Step 2: Compare other options
One-hot encoding loses semantic info, random vectors have no meaning, and using only first word misses context.Final Answer:
Use pretrained word embeddings and average their vectors for the whole sentence -> Option BQuick Check:
Average pretrained embeddings = better sentence meaning [OK]
- Using one-hot encoding which lacks meaning
- Using random vectors without training
- Ignoring all words except the first
