Word2Vec helps computers understand words by turning them into numbers based on their meaning. It learns which words appear together in sentences.
Word2Vec (CBOW and Skip-gram) in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
from gensim.models import Word2Vec model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0) # sg=0 means CBOW, sg=1 means Skip-gram
sentences is a list of tokenized sentences (list of word lists).
vector_size sets the size of the word vectors (like how many numbers represent each word).
model = Word2Vec(sentences, vector_size=50, window=3, sg=0)
model = Word2Vec(sentences, vector_size=100, window=5, sg=1)
This code trains two Word2Vec models: one using CBOW and one using Skip-gram. It then shows the vector for the word 'machine' and finds words similar to 'machine' in both models.
from gensim.models import Word2Vec # Sample sentences sentences = [ ['I', 'love', 'machine', 'learning'], ['Word2Vec', 'helps', 'understand', 'words'], ['Skip', 'gram', 'and', 'CBOW', 'are', 'models'], ['Machine', 'learning', 'is', 'fun'] ] # Train CBOW model (sg=0) model_cbow = Word2Vec(sentences, vector_size=20, window=2, min_count=1, sg=0) # Train Skip-gram model (sg=1) model_sg = Word2Vec(sentences, vector_size=20, window=2, min_count=1, sg=1) # Get vector for word 'machine' vec_cbow = model_cbow.wv['machine'] vec_sg = model_sg.wv['machine'] # Find most similar words to 'machine' in CBOW similar_cbow = model_cbow.wv.most_similar('machine') # Find most similar words to 'machine' in Skip-gram similar_sg = model_sg.wv.most_similar('machine') print('CBOW vector for machine:', vec_cbow) print('Skip-gram vector for machine:', vec_sg) print('CBOW most similar to machine:', similar_cbow) print('Skip-gram most similar to machine:', similar_sg)
CBOW predicts a word from its surrounding words, so it works well with frequent words.
Skip-gram predicts surrounding words from a given word, so it works better with rare words.
Word vectors are lists of numbers that capture word meaning based on context.
Word2Vec turns words into numbers that show their meaning.
CBOW and Skip-gram are two ways Word2Vec learns word meanings.
Use Word2Vec to find similar words or prepare text for machine learning.
Practice
Solution
Step 1: Understand CBOW model purpose
CBOW tries to predict the target word using the surrounding context words.Step 2: Understand Skip-gram model purpose
Skip-gram tries to predict the surrounding context words given the target word.Final Answer:
CBOW predicts a word based on its context, while Skip-gram predicts context words from a target word. -> Option BQuick Check:
CBOW = context to word, Skip-gram = word to context [OK]
- Confusing which model predicts context vs. target word
- Thinking both models do the same prediction
- Assuming CBOW needs labeled data
Solution
Step 1: Identify correct parameter for Skip-gram
In Gensim, 'sg=1' sets Skip-gram, 'sg=0' sets CBOW.Step 2: Use correct parameter names
Since Gensim 4.0+, 'vector_size' replaces 'size' for embedding dimension.Final Answer:
Word2Vec(sentences, vector_size=100, window=5, sg=1) -> Option DQuick Check:
sg=1 and vector_size used correctly [OK]
- Using 'size' instead of 'vector_size' in recent Gensim versions
- Setting sg=0 which is CBOW, not Skip-gram
- Confusing sg parameter values
model.wv.most_similar('king', topn=1) if the model is trained on a typical English corpus?Solution
Step 1: Understand Word2Vec similarity
Word2Vec finds words with similar meanings or contexts; 'queen' is semantically close to 'king'.Step 2: Analyze typical English corpus relations
Words like 'apple', 'car', or 'run' are unrelated to 'king' in meaning or context.Final Answer:
[('queen', similarity_score)] -> Option CQuick Check:
Most similar to 'king' is 'queen' [OK]
- Choosing unrelated words as most similar
- Confusing syntactic similarity with semantic similarity
- Expecting exact similarity scores
KeyError: 'unknown_word' when querying model.wv['unknown_word']. What is the most likely cause and fix?Solution
Step 1: Understand KeyError cause
KeyError occurs when the queried word is not in the model's vocabulary.Step 2: Fix by ensuring word presence
Either add the word to training data or check if word exists before querying to avoid error.Final Answer:
The word was not in training data; retrain with larger corpus or check vocabulary before querying. -> Option AQuick Check:
KeyError means word missing in vocabulary [OK]
- Assuming model type (CBOW/Skip-gram) causes KeyError
- Changing vector or window size to fix missing word error
- Ignoring vocabulary check before querying
Solution
Step 1: Identify model for rare words
Skip-gram is better at learning rare word representations than CBOW.Step 2: Adjust window size and epochs
Smaller window focuses on close context, improving rare word meaning; more epochs improve training quality.Final Answer:
Use Skip-gram with a smaller window size and increase training epochs. -> Option AQuick Check:
Skip-gram + small window + more epochs = better rare word capture [OK]
- Choosing CBOW for rare word learning
- Using large window size which dilutes context
- Reducing epochs which limits training
