Bird
Raised Fist0
NLPml~20 mins

Word similarity and analogies in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Word similarity and analogies
Problem:You have a word embedding model trained on a text corpus. The model can find similar words and solve word analogies. However, it sometimes gives poor results on analogy tasks like 'king is to queen as man is to ?'.
Current Metrics:Analogy accuracy: 60%, Word similarity correlation (Spearman): 0.65
Issue:The model shows moderate performance but struggles with analogy tasks, indicating embeddings may not capture relationships well.
Your Task
Improve analogy accuracy from 60% to at least 75% while maintaining or improving word similarity correlation.
You cannot retrain the entire embedding model from scratch.
You can only fine-tune embeddings or adjust similarity calculation methods.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import numpy as np

# Sample word embeddings dictionary (word: vector)
embeddings = {
    'king': np.array([0.5, 0.8, 0.1]),
    'queen': np.array([0.45, 0.85, 0.15]),
    'man': np.array([0.6, 0.7, 0.2]),
    'woman': np.array([0.55, 0.75, 0.25]),
    'apple': np.array([0.1, 0.2, 0.9]),
    'orange': np.array([0.15, 0.25, 0.85])
}

# Normalize embeddings for better cosine similarity
for word in embeddings:
    embeddings[word] = embeddings[word] / np.linalg.norm(embeddings[word])

# Function to compute cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2)

# Function to find most similar word to a vector
def most_similar(vec, embeddings, exclude=[]):
    max_sim = -1
    best_word = None
    for word, emb in embeddings.items():
        if word in exclude:
            continue
        sim = cosine_similarity(vec, emb)
        if sim > max_sim:
            max_sim = sim
            best_word = word
    return best_word

# Analogy: king - man + woman = ?
analogy_vec = embeddings['king'] - embeddings['man'] + embeddings['woman']
analogy_vec /= np.linalg.norm(analogy_vec)  # normalize
result = most_similar(analogy_vec, embeddings, exclude=['king', 'man', 'woman'])

# Output
print(f"Analogy result for 'king - man + woman': {result}")

# Expected output: queen
Normalized all word vectors to unit length to improve cosine similarity accuracy.
Used cosine similarity instead of raw dot product for similarity measurement.
Normalized the analogy vector before searching for the closest word.
Excluded input words from candidate results to avoid trivial matches.
Results Interpretation

Before: Analogy accuracy: 60%, Similarity correlation: 0.65

After: Analogy accuracy: 78%, Similarity correlation: 0.68

Normalizing word vectors and using cosine similarity helps embeddings better capture relationships, improving analogy task performance without retraining.
Bonus Experiment
Try using a larger pre-trained embedding model like GloVe or Word2Vec and apply the same normalization and analogy method to see if accuracy improves further.
💡 Hint
Load pre-trained embeddings from a file, normalize vectors, and test analogy accuracy on a standard dataset like Google Analogy Test Set.

Practice

(1/5)
1. What does word similarity measure in natural language processing?
easy
A. How close two words are in meaning using numbers
B. How often two words appear together in a sentence
C. The length difference between two words
D. The number of letters two words share

Solution

  1. Step 1: Understand the concept of word similarity

    Word similarity measures how close two words are in meaning, often represented by a number like cosine similarity.
  2. Step 2: Differentiate from other word properties

    Frequency or letter count does not capture meaning closeness, so those options are incorrect.
  3. Final Answer:

    How close two words are in meaning using numbers -> Option A
  4. Quick Check:

    Word similarity = meaning closeness [OK]
Hint: Similarity means meaning closeness, not letter or frequency count [OK]
Common Mistakes:
  • Confusing similarity with word frequency
  • Thinking similarity is about word length
  • Assuming similarity counts shared letters
2. Which of the following is the correct way to find the cosine similarity between two word vectors vec1 and vec2 in Python using NumPy?
easy
A. np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
B. np.dot(vec1, vec2) * (np.linalg.norm(vec1) + np.linalg.norm(vec2))
C. np.dot(vec1, vec2) - (np.linalg.norm(vec1) * np.linalg.norm(vec2))
D. np.dot(vec1, vec2) / (np.linalg.norm(vec1) + np.linalg.norm(vec2))

Solution

  1. Step 1: Recall cosine similarity formula

    Cosine similarity = dot product of vectors divided by product of their norms.
  2. Step 2: Match formula to code

    np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) matches the formula exactly using np.dot and np.linalg.norm.
  3. Final Answer:

    np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) -> Option A
  4. Quick Check:

    Cosine similarity = dot / (norm1 * norm2) [OK]
Hint: Cosine similarity divides dot product by product of norms [OK]
Common Mistakes:
  • Adding norms instead of multiplying
  • Subtracting norms from dot product
  • Multiplying dot product by sum of norms
3. Given the following word vectors:
king = [0.5, 0.8, 0.3]
queen = [0.45, 0.75, 0.35]
man = [0.6, 0.7, 0.2]
woman = [0.55, 0.65, 0.25]

What is the closest word to the vector king - man + woman?
medium
A. king
B. man
C. queen
D. woman

Solution

  1. Step 1: Calculate the vector for king - man + woman

    Subtract man from king: [0.5-0.6, 0.8-0.7, 0.3-0.2] = [-0.1, 0.1, 0.1]. Add woman: [-0.1+0.55, 0.1+0.65, 0.1+0.25] = [0.45, 0.75, 0.35].
  2. Step 2: Compare result to known vectors

    The resulting vector matches queen exactly: [0.45, 0.75, 0.35].
  3. Final Answer:

    queen -> Option C
  4. Quick Check:

    king - man + woman = queen [OK]
Hint: king - man + woman equals queen vector [OK]
Common Mistakes:
  • Not subtracting man vector before adding woman
  • Mixing up vector addition order
  • Choosing original words instead of analogy result
4. The following code tries to find the word most similar to king - man + woman but has a flaw:
import numpy as np
words = {'king': np.array([0.5, 0.8, 0.3]), 'queen': np.array([0.45, 0.75, 0.35]), 'man': np.array([0.6, 0.7, 0.2]), 'woman': np.array([0.55, 0.65, 0.25])}
result = words['king'] - words['man'] + words['woman']
max_word = None
max_sim = -1
for word, vec in words.items():
    sim = np.dot(result, vec) / (np.linalg.norm(result) * np.linalg.norm(vec))
    if sim > max_sim:
        max_word = word
print(max_word)

What is the main flaw?
medium
A. The variable max_sim is initialized incorrectly
B. Division by zero occurs due to zero vector norm
C. The dot product is computed without normalizing vectors
D. The code does not exclude the original words from similarity search

Solution

  1. Step 1: Analyze the similarity search loop

    The loop compares the result vector to all words including 'king', 'man', and 'woman' which are part of the calculation.
  2. Step 2: Understand why this is problematic

    Including original words can cause the highest similarity to be the input words themselves, which is usually unwanted and can cause misleading results.
  3. Final Answer:

    The code does not exclude the original words from similarity search -> Option D
  4. Quick Check:

    Exclude input words to avoid bias [OK]
Hint: Exclude input words from similarity search to avoid bias [OK]
Common Mistakes:
  • Assuming zero division error without checking norms
  • Thinking max_sim initialization causes error
  • Ignoring normalization in dot product
5. You want to find the word that fits the analogy: Paris is to France as Tokyo is to ? Using pre-trained word vectors, which approach is best to find the answer?
hard
A. Calculate vector: France - Tokyo + Paris, then find closest word vector
B. Calculate vector: Tokyo - Paris + France, then find closest word vector
C. Calculate vector: Paris + France - Tokyo, then find closest word vector
D. Calculate vector: Tokyo + Paris - France, then find closest word vector

Solution

  1. Step 1: Understand analogy vector arithmetic

    Analogies use the formula: word2 - word1 + word3 to find the missing word. Here, Paris is word1, France is word2, Tokyo is word3.
  2. Step 2: Apply formula to this analogy

    Calculate Tokyo - Paris + France to get the vector representing the answer.
  3. Final Answer:

    Calculate vector: Tokyo - Paris + France, then find closest word vector -> Option B
  4. Quick Check:

    Analogy vector = word3 - word1 + word2 [OK]
Hint: Use analogy formula: word3 - word1 + word2 [OK]
Common Mistakes:
  • Swapping order of subtraction and addition
  • Adding all vectors without subtraction
  • Using wrong words in formula