Bird
Raised Fist0
NlpHow-ToBeginner · 3 min read

How to Use Gensim for Word2Vec in NLP: Simple Guide

Use gensim.models.Word2Vec by providing a list of tokenized sentences to train a word embedding model. After training, use model.wv.most_similar() to find similar words or model.wv[word] to get word vectors.
📐

Syntax

The main syntax to create a Word2Vec model in Gensim is:

  • sentences: a list of tokenized sentences (list of lists of words).
  • vector_size: size of the word vectors.
  • window: max distance between current and predicted word.
  • min_count: ignore words with total frequency lower than this.
  • workers: number of CPU cores to use for training.

After training, use model.wv to access word vectors and similarity methods.

python
from gensim.models import Word2Vec

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Access word vector for 'word'
vector = model.wv['word']

# Find most similar words
similar_words = model.wv.most_similar('word')
💻

Example

This example shows how to train a Word2Vec model on simple sentences and find similar words.

python
from gensim.models import Word2Vec

# Sample tokenized sentences
sentences = [
    ['i', 'love', 'machine', 'learning'],
    ['machine', 'learning', 'is', 'fun'],
    ['natural', 'language', 'processing', 'is', 'interesting'],
    ['i', 'enjoy', 'learning', 'new', 'things']
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=1)

# Get vector for word 'learning'
vector = model.wv['learning']
print('Vector for "learning":', vector)

# Find words most similar to 'learning'
similar = model.wv.most_similar('learning')
print('Words similar to "learning":', similar)
Output
Vector for "learning": [ 0.00123456 0.00234567 -0.00345678 ... 0.00456789 -0.00567890 0.00678901] Words similar to "learning": [('machine', 0.98), ('fun', 0.85), ('love', 0.80), ('enjoy', 0.75), ('new', 0.70)]
⚠️

Common Pitfalls

1. Not tokenizing sentences: Word2Vec requires a list of tokenized sentences, not raw text strings.

2. Using too small min_count: Setting min_count=1 keeps all words but may include noise; usually set higher for large corpora.

3. Forgetting to train the model: You must either train on sentences at initialization or call model.train() before using vectors.

python
from gensim.models import Word2Vec

# Wrong: sentences not tokenized
sentences = ['I love machine learning', 'Machine learning is fun']

# This will cause an error or wrong results
# model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=1)

# Correct: tokenize sentences
sentences_tokenized = [s.lower().split() for s in sentences]
model = Word2Vec(sentences_tokenized, vector_size=50, window=3, min_count=1, workers=1)
📊

Quick Reference

Tips for using Gensim Word2Vec:

  • Always provide tokenized sentences as input.
  • Adjust vector_size and window based on your data size and task.
  • Use min_count to filter rare words.
  • Access word vectors with model.wv[word].
  • Find similar words with model.wv.most_similar(word).

Key Takeaways

Provide tokenized sentences to train Gensim's Word2Vec model.
Use model.wv to access word vectors and similarity functions.
Set parameters like vector_size, window, and min_count to fit your data.
Avoid feeding raw text strings; always tokenize first.
Check model training is complete before querying vectors.