How to Use Gensim for Word2Vec in NLP: Simple Guide
Use
gensim.models.Word2Vec by providing a list of tokenized sentences to train a word embedding model. After training, use model.wv.most_similar() to find similar words or model.wv[word] to get word vectors.Syntax
The main syntax to create a Word2Vec model in Gensim is:
sentences: a list of tokenized sentences (list of lists of words).vector_size: size of the word vectors.window: max distance between current and predicted word.min_count: ignore words with total frequency lower than this.workers: number of CPU cores to use for training.
After training, use model.wv to access word vectors and similarity methods.
python
from gensim.models import Word2Vec model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) # Access word vector for 'word' vector = model.wv['word'] # Find most similar words similar_words = model.wv.most_similar('word')
Example
This example shows how to train a Word2Vec model on simple sentences and find similar words.
python
from gensim.models import Word2Vec # Sample tokenized sentences sentences = [ ['i', 'love', 'machine', 'learning'], ['machine', 'learning', 'is', 'fun'], ['natural', 'language', 'processing', 'is', 'interesting'], ['i', 'enjoy', 'learning', 'new', 'things'] ] # Train Word2Vec model model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=1) # Get vector for word 'learning' vector = model.wv['learning'] print('Vector for "learning":', vector) # Find words most similar to 'learning' similar = model.wv.most_similar('learning') print('Words similar to "learning":', similar)
Output
Vector for "learning": [ 0.00123456 0.00234567 -0.00345678 ... 0.00456789 -0.00567890 0.00678901]
Words similar to "learning": [('machine', 0.98), ('fun', 0.85), ('love', 0.80), ('enjoy', 0.75), ('new', 0.70)]
Common Pitfalls
1. Not tokenizing sentences: Word2Vec requires a list of tokenized sentences, not raw text strings.
2. Using too small min_count: Setting min_count=1 keeps all words but may include noise; usually set higher for large corpora.
3. Forgetting to train the model: You must either train on sentences at initialization or call model.train() before using vectors.
python
from gensim.models import Word2Vec # Wrong: sentences not tokenized sentences = ['I love machine learning', 'Machine learning is fun'] # This will cause an error or wrong results # model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=1) # Correct: tokenize sentences sentences_tokenized = [s.lower().split() for s in sentences] model = Word2Vec(sentences_tokenized, vector_size=50, window=3, min_count=1, workers=1)
Quick Reference
Tips for using Gensim Word2Vec:
- Always provide tokenized sentences as input.
- Adjust
vector_sizeandwindowbased on your data size and task. - Use
min_countto filter rare words. - Access word vectors with
model.wv[word]. - Find similar words with
model.wv.most_similar(word).
Key Takeaways
Provide tokenized sentences to train Gensim's Word2Vec model.
Use model.wv to access word vectors and similarity functions.
Set parameters like vector_size, window, and min_count to fit your data.
Avoid feeding raw text strings; always tokenize first.
Check model training is complete before querying vectors.
