Bird
Raised Fist0
NlpHow-ToBeginner · 4 min read

How to Use Word Embeddings in Python for NLP Tasks

Use word embeddings in Python by loading pre-trained models like Word2Vec or GloVe with libraries such as gensim. These embeddings convert words into vectors that capture meaning, enabling NLP tasks like similarity or classification.
📐

Syntax

To use word embeddings in Python, you typically load a pre-trained embedding model and then get vector representations for words.

  • gensim.models.KeyedVectors.load_word2vec_format(): Load pre-trained Word2Vec or GloVe embeddings.
  • model[word]: Access the vector for a specific word.
  • model.similarity(word1, word2): Compute similarity between two words.
python
from gensim.models import KeyedVectors

# Load pre-trained embeddings (Google News Word2Vec format)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Get vector for a word
vector = model['king']

# Compute similarity between two words
similarity = model.similarity('king', 'queen')
💻

Example

This example shows how to load a small pre-trained Word2Vec model from gensim, get word vectors, and find similar words.

python
from gensim.models import Word2Vec
from gensim.test.utils import common_texts

# Train a small Word2Vec model on sample data
model = Word2Vec(sentences=common_texts, vector_size=50, window=3, min_count=1, workers=1)

# Get vector for a word
vector = model.wv['computer']
print('Vector for "computer":', vector[:5])  # print first 5 values

# Find most similar words to 'computer'
similar_words = model.wv.most_similar('computer', topn=3)
print('Words similar to "computer":', similar_words)
Output
Vector for "computer": [ 0.00151203 -0.00234842 0.00279991 0.00252249 -0.00299346] Words similar to "computer": [('system', 0.253), ('graph', 0.204), ('user', 0.198)]
⚠️

Common Pitfalls

  • Loading large pre-trained embeddings without enough memory causes crashes.
  • Using out-of-vocabulary words returns errors or random vectors.
  • Confusing word vectors with tokenized words; embeddings require exact matching.
  • Not normalizing vectors before similarity can give misleading results.

Always check if a word exists in the model before accessing its vector.

python
from gensim.models import KeyedVectors

# Wrong way: Accessing vector without checking
# vector = model['unknownword']  # Raises KeyError if word not in vocab

# Right way: Check first
word = 'unknownword'
if word in model:
    vector = model[word]
else:
    vector = None  # Handle missing word gracefully
📊

Quick Reference

Function/MethodPurpose
load_word2vec_format(path, binary=True/False)Load pre-trained embeddings from file
model[word]Get vector for a word
model.similarity(word1, word2)Compute similarity between two words
model.most_similar(word, topn=5)Find words most similar to given word
word in modelCheck if word exists in embeddings vocabulary

Key Takeaways

Load pre-trained word embeddings using gensim for easy access to word vectors.
Always check if a word exists in the model before using its vector to avoid errors.
Word embeddings convert words into numbers that capture meaning for NLP tasks.
Small custom models can be trained quickly on your own text using gensim's Word2Vec.
Use similarity and most_similar methods to explore relationships between words.