How to Use Sentence Transformers in Python for NLP Tasks
Use the
sentence-transformers Python library to convert sentences into vectors by loading a pre-trained model with SentenceTransformer and calling encode() on your text. These vectors can then be used for tasks like semantic search, clustering, or classification.Syntax
The main steps to use sentence transformers in Python are:
from sentence_transformers import SentenceTransformer: Import the model class.model = SentenceTransformer('model_name'): Load a pre-trained model by name.embeddings = model.encode(sentences): Convert one or more sentences into vector embeddings.
These embeddings are numerical arrays representing the meaning of sentences.
python
from sentence_transformers import SentenceTransformer # Load a pre-trained model model = SentenceTransformer('all-MiniLM-L6-v2') # Encode sentences to get embeddings sentences = ['This is an example sentence', 'Each sentence is converted'] embeddings = model.encode(sentences) print(embeddings.shape)
Output
(2, 384)
Example
This example shows how to load a model, encode sentences, and compare their similarity using cosine similarity.
python
from sentence_transformers import SentenceTransformer, util # Load model model = SentenceTransformer('all-MiniLM-L6-v2') # Sentences to encode sentences = ['I love machine learning', 'I enjoy studying AI', 'The weather is sunny today'] # Get embeddings embeddings = model.encode(sentences, convert_to_tensor=True) # Compute cosine similarity between first two sentences similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1]) print(f"Similarity between sentence 1 and 2: {similarity.item():.4f}")
Output
Similarity between sentence 1 and 2: 0.7924
Common Pitfalls
Common mistakes when using sentence transformers include:
- Not installing the
sentence-transformerspackage before use. - Passing a single sentence as a string instead of a list to
encode(). - Forgetting to convert embeddings to tensors when using similarity functions that require tensors.
- Using very large models unnecessarily, which slows down encoding.
python
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # Wrong: passing string instead of list # embeddings = model.encode('This is a sentence') # This works but returns 1D array # Right: pass list even for one sentence embeddings = model.encode(['This is a sentence']) print(embeddings.shape) # (1, 384)
Output
(1, 384)
Quick Reference
| Function/Method | Purpose |
|---|---|
| SentenceTransformer('model_name') | Load a pre-trained sentence transformer model |
| model.encode(sentences) | Convert sentences (list) to vector embeddings |
| util.pytorch_cos_sim(vec1, vec2) | Calculate cosine similarity between two vectors |
| convert_to_tensor=True | Option in encode() to get PyTorch tensors for similarity |
| model.encode(['sentence']) | Encode a single sentence as a list to get 2D array |
Key Takeaways
Load a pre-trained model with SentenceTransformer('model_name') before encoding.
Always pass a list of sentences to model.encode() for consistent output shape.
Use embeddings for tasks like similarity by computing cosine similarity between vectors.
Install the sentence-transformers package via pip before using it.
Choose smaller models like 'all-MiniLM-L6-v2' for faster encoding with good accuracy.
