0
0
NLPml~5 mins

Training Word2Vec with Gensim in NLP

Choose your learning style9 modes available
Introduction

Word2Vec helps computers understand words by turning them into numbers that keep their meaning. Training Word2Vec with Gensim lets you create these word numbers from your own text.

You want to find similar words in a collection of documents.
You need to convert words into numbers for a machine learning model.
You want to explore relationships between words, like 'king' and 'queen'.
You have a custom text dataset and want to learn word meanings from it.
Syntax
NLP
from gensim.models import Word2Vec

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4, epochs=10)

sentences is a list of tokenized sentences (list of lists of words).

vector_size sets the size of the word vectors (default 100).

Examples
Train Word2Vec on two simple sentences with smaller vector size and fewer epochs.
NLP
sentences = [['hello', 'world'], ['machine', 'learning', 'is', 'fun']]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, epochs=5)
Ignore words that appear less than twice by setting min_count=2.
NLP
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4, epochs=10)
Sample Model

This code trains Word2Vec on a few example sentences, then shows the vector for the word 'learning' and the top 3 similar words.

NLP
from gensim.models import Word2Vec

# Sample sentences
sentences = [
    ['I', 'love', 'machine', 'learning'],
    ['Word2Vec', 'creates', 'word', 'embeddings'],
    ['Gensim', 'makes', 'training', 'easy'],
    ['I', 'enjoy', 'learning', 'new', 'things']
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=1, epochs=10)

# Get vector for word 'learning'
vector = model.wv['learning']

# Find most similar words to 'learning'
similar_words = model.wv.most_similar('learning', topn=3)

print(f"Vector for 'learning' (first 5 values): {vector[:5]}")
print('Top 3 words similar to learning:', similar_words)
OutputSuccess
Important Notes

Make sure your sentences are tokenized (split into words) before training.

More epochs usually improve the quality but take longer to train.

Use model.wv.most_similar(word) to find words close in meaning.

Summary

Word2Vec turns words into numbers that keep their meaning.

Gensim makes training Word2Vec easy with simple code.

Try different settings like vector size and window to get better results.