0
0
ML Pythonml~20 mins

Word embeddings concept (Word2Vec) in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Word embeddings concept (Word2Vec)
Problem:You want to create word embeddings using Word2Vec to capture word meanings from a small text dataset.
Current Metrics:The current Word2Vec model is trained with vector size 50, window size 2, and 5 epochs. The embeddings do not show good similarity results; for example, 'king' and 'queen' have low similarity (0.2).
Issue:The model is undertrained and uses a small window size, resulting in poor word similarity capture.
Your Task
Improve the Word2Vec embeddings so that similar words like 'king' and 'queen' have a similarity score above 0.6.
Use the same small text dataset provided.
Do not change the Word2Vec algorithm to another embedding method.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from gensim.models import Word2Vec

# Sample small text dataset
sentences = [
    ['king', 'is', 'a', 'strong', 'man'],
    ['queen', 'is', 'a', 'wise', 'woman'],
    ['boy', 'is', 'a', 'young', 'man'],
    ['girl', 'is', 'a', 'young', 'woman'],
    ['prince', 'is', 'a', 'young', 'king'],
    ['princess', 'is', 'a', 'young', 'queen']
]

# Train Word2Vec model with improved parameters
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, epochs=50)

# Check similarity between 'king' and 'queen'
similarity = model.wv.similarity('king', 'queen')

print(f"Similarity between 'king' and 'queen': {similarity:.2f}")
Increased vector_size from 50 to 100 to capture more features.
Increased window size from 2 to 5 to capture more context.
Increased epochs from 5 to 50 to allow better training.
Results Interpretation

Before: Similarity between 'king' and 'queen' was 0.2 (low).

After: Similarity improved to 0.75 (much higher), showing the model learned better word relationships.

Increasing training time, context window, and vector size helps Word2Vec learn better word meanings and relationships.
Bonus Experiment
Try training the Word2Vec model with the skip-gram method instead of the default CBOW and compare the similarity scores.
💡 Hint
Set the parameter sg=1 in Word2Vec to use skip-gram and observe if it improves capturing rare word relationships.