Word similarity and analogies help computers understand how words relate to each other, like how 'king' relates to 'queen'. This makes language tasks easier and smarter.
Word similarity and analogies in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
from gensim.models import KeyedVectors # Load pre-trained word vectors model = KeyedVectors.load_word2vec_format('path/to/word2vec.bin', binary=True) # Find similarity between two words similarity = model.similarity('word1', 'word2') # Find words similar to a given word similar_words = model.most_similar('word', topn=5) # Solve analogy: word_a is to word_b as word_c is to ? result = model.most_similar(positive=['word_b', 'word_c'], negative=['word_a'], topn=1)
You need pre-trained word vectors like Word2Vec or GloVe to use these methods.
Similarity returns a score between -1 and 1 showing how close two words are.
similarity = model.similarity('cat', 'dog') print(similarity)
similar_words = model.most_similar('king', topn=3) print(similar_words)
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) print(result)
This program loads a small word vector model, calculates similarity between 'cat' and 'dog', finds words similar to 'king', and solves a simple analogy.
from gensim.models import KeyedVectors # Load a small pre-trained model for demonstration # Here we use a small subset from gensim-data for quick testing import gensim.downloader as api model = api.load('glove-wiki-gigaword-50') # Calculate similarity between 'cat' and 'dog' similarity = model.similarity('cat', 'dog') print(f"Similarity between 'cat' and 'dog': {similarity:.2f}") # Find top 3 words similar to 'king' similar_words = model.most_similar('king', topn=3) print("Top 3 words similar to 'king':") for word, score in similar_words: print(f"{word}: {score:.2f}") # Solve analogy: man is to king as woman is to ? result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) print(f"'man' is to 'king' as 'woman' is to '{result[0][0]}' with score {result[0][1]:.2f}")
Pre-trained models can be large; using smaller ones helps beginners experiment quickly.
Not all words will be in the model vocabulary; check with 'word in model' before using.
Similarity scores closer to 1 mean very similar; closer to 0 or negative means less related.
Word similarity measures how close two words are in meaning using numbers.
Analogies let us find a word that fits a relationship between other words.
Pre-trained word vectors are needed to do these tasks easily.
Practice
Solution
Step 1: Understand the concept of word similarity
Word similarity measures how close two words are in meaning, often represented by a number like cosine similarity.Step 2: Differentiate from other word properties
Frequency or letter count does not capture meaning closeness, so those options are incorrect.Final Answer:
How close two words are in meaning using numbers -> Option AQuick Check:
Word similarity = meaning closeness [OK]
- Confusing similarity with word frequency
- Thinking similarity is about word length
- Assuming similarity counts shared letters
vec1 and vec2 in Python using NumPy?Solution
Step 1: Recall cosine similarity formula
Cosine similarity = dot product of vectors divided by product of their norms.Step 2: Match formula to code
np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) matches the formula exactly using np.dot and np.linalg.norm.Final Answer:
np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) -> Option AQuick Check:
Cosine similarity = dot / (norm1 * norm2) [OK]
- Adding norms instead of multiplying
- Subtracting norms from dot product
- Multiplying dot product by sum of norms
king = [0.5, 0.8, 0.3] queen = [0.45, 0.75, 0.35] man = [0.6, 0.7, 0.2] woman = [0.55, 0.65, 0.25]
What is the closest word to the vector
king - man + woman?Solution
Step 1: Calculate the vector for king - man + woman
Subtract man from king: [0.5-0.6, 0.8-0.7, 0.3-0.2] = [-0.1, 0.1, 0.1]. Add woman: [-0.1+0.55, 0.1+0.65, 0.1+0.25] = [0.45, 0.75, 0.35].Step 2: Compare result to known vectors
The resulting vector matches queen exactly: [0.45, 0.75, 0.35].Final Answer:
queen -> Option CQuick Check:
king - man + woman = queen [OK]
- Not subtracting man vector before adding woman
- Mixing up vector addition order
- Choosing original words instead of analogy result
king - man + woman but has a flaw:import numpy as np
words = {'king': np.array([0.5, 0.8, 0.3]), 'queen': np.array([0.45, 0.75, 0.35]), 'man': np.array([0.6, 0.7, 0.2]), 'woman': np.array([0.55, 0.65, 0.25])}
result = words['king'] - words['man'] + words['woman']
max_word = None
max_sim = -1
for word, vec in words.items():
sim = np.dot(result, vec) / (np.linalg.norm(result) * np.linalg.norm(vec))
if sim > max_sim:
max_word = word
print(max_word)What is the main flaw?
Solution
Step 1: Analyze the similarity search loop
The loop compares the result vector to all words including 'king', 'man', and 'woman' which are part of the calculation.Step 2: Understand why this is problematic
Including original words can cause the highest similarity to be the input words themselves, which is usually unwanted and can cause misleading results.Final Answer:
The code does not exclude the original words from similarity search -> Option DQuick Check:
Exclude input words to avoid bias [OK]
- Assuming zero division error without checking norms
- Thinking max_sim initialization causes error
- Ignoring normalization in dot product
Paris is to France as Tokyo is to ? Using pre-trained word vectors, which approach is best to find the answer?Solution
Step 1: Understand analogy vector arithmetic
Analogies use the formula: word2 - word1 + word3 to find the missing word. Here, Paris is word1, France is word2, Tokyo is word3.Step 2: Apply formula to this analogy
Calculate Tokyo - Paris + France to get the vector representing the answer.Final Answer:
Calculate vector: Tokyo - Paris + France, then find closest word vector -> Option BQuick Check:
Analogy vector = word3 - word1 + word2 [OK]
- Swapping order of subtraction and addition
- Adding all vectors without subtraction
- Using wrong words in formula
