How to Use Word Embeddings in Python for NLP Tasks
Use
word embeddings in Python by loading pre-trained models like Word2Vec or GloVe with libraries such as gensim. These embeddings convert words into vectors that capture meaning, enabling NLP tasks like similarity or classification.Syntax
To use word embeddings in Python, you typically load a pre-trained embedding model and then get vector representations for words.
gensim.models.KeyedVectors.load_word2vec_format(): Load pre-trained Word2Vec or GloVe embeddings.model[word]: Access the vector for a specific word.model.similarity(word1, word2): Compute similarity between two words.
python
from gensim.models import KeyedVectors # Load pre-trained embeddings (Google News Word2Vec format) model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) # Get vector for a word vector = model['king'] # Compute similarity between two words similarity = model.similarity('king', 'queen')
Example
This example shows how to load a small pre-trained Word2Vec model from gensim, get word vectors, and find similar words.
python
from gensim.models import Word2Vec from gensim.test.utils import common_texts # Train a small Word2Vec model on sample data model = Word2Vec(sentences=common_texts, vector_size=50, window=3, min_count=1, workers=1) # Get vector for a word vector = model.wv['computer'] print('Vector for "computer":', vector[:5]) # print first 5 values # Find most similar words to 'computer' similar_words = model.wv.most_similar('computer', topn=3) print('Words similar to "computer":', similar_words)
Output
Vector for "computer": [ 0.00151203 -0.00234842 0.00279991 0.00252249 -0.00299346]
Words similar to "computer": [('system', 0.253), ('graph', 0.204), ('user', 0.198)]
Common Pitfalls
- Loading large pre-trained embeddings without enough memory causes crashes.
- Using out-of-vocabulary words returns errors or random vectors.
- Confusing word vectors with tokenized words; embeddings require exact matching.
- Not normalizing vectors before similarity can give misleading results.
Always check if a word exists in the model before accessing its vector.
python
from gensim.models import KeyedVectors # Wrong way: Accessing vector without checking # vector = model['unknownword'] # Raises KeyError if word not in vocab # Right way: Check first word = 'unknownword' if word in model: vector = model[word] else: vector = None # Handle missing word gracefully
Quick Reference
| Function/Method | Purpose |
|---|---|
| load_word2vec_format(path, binary=True/False) | Load pre-trained embeddings from file |
| model[word] | Get vector for a word |
| model.similarity(word1, word2) | Compute similarity between two words |
| model.most_similar(word, topn=5) | Find words most similar to given word |
| word in model | Check if word exists in embeddings vocabulary |
Key Takeaways
Load pre-trained word embeddings using gensim for easy access to word vectors.
Always check if a word exists in the model before using its vector to avoid errors.
Word embeddings convert words into numbers that capture meaning for NLP tasks.
Small custom models can be trained quickly on your own text using gensim's Word2Vec.
Use similarity and most_similar methods to explore relationships between words.
