How to Calculate Cosine Similarity for Text in NLP
To calculate
cosine similarity between texts in NLP, first convert texts into numeric vectors using methods like TF-IDF or CountVectorizer. Then compute the cosine similarity by measuring the cosine of the angle between these vectors, which shows how similar the texts are.Syntax
To calculate cosine similarity between two text documents, follow these steps:
- Vectorize texts: Convert text into numeric vectors using
CountVectorizerorTfidfVectorizer. - Calculate cosine similarity: Use
cosine_similarityfunction fromsklearn.metrics.pairwiseto get similarity score.
Example syntax:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform([text1, text2]) similarity = cosine_similarity(vectors[0], vectors[1])
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity text1 = "I love machine learning" text2 = "Machine learning is great" vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform([text1, text2]) similarity = cosine_similarity(vectors[0], vectors[1]) print(similarity[0][0])
Output
0.7071067811865475
Example
This example shows how to calculate cosine similarity between two simple sentences using TfidfVectorizer. It converts the sentences into vectors and then computes the cosine similarity score, which ranges from 0 (no similarity) to 1 (identical).
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Two sample texts text1 = "I enjoy reading books about AI" text2 = "Reading about artificial intelligence is fun" # Convert texts to TF-IDF vectors vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform([text1, text2]) # Calculate cosine similarity similarity_score = cosine_similarity(vectors[0], vectors[1]) print(f"Cosine similarity: {similarity_score[0][0]:.4f}")
Output
Cosine similarity: 0.5669
Common Pitfalls
Common mistakes when calculating cosine similarity for text include:
- Not vectorizing text before similarity calculation, which causes errors.
- Using raw text strings directly instead of numeric vectors.
- Ignoring preprocessing like lowercasing or removing stopwords, which can affect similarity.
- Confusing cosine similarity with Euclidean distance; they measure different things.
Always ensure texts are properly vectorized and preprocessed before calculating cosine similarity.
python
from sklearn.metrics.pairwise import cosine_similarity # Wrong: passing raw strings try: cosine_similarity(["text one"], ["text two"]) except Exception as e: print(f"Error: {e}") # Right: vectorize first from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() vectors = vectorizer.fit_transform(["text one", "text two"]) sim = cosine_similarity(vectors[0], vectors[1]) print(f"Correct similarity: {sim[0][0]:.4f}")
Output
Error: Expected 2D array, got scalar array instead: 'text one'
Correct similarity: 0.0000
Quick Reference
Tips for calculating cosine similarity in NLP:
- Use
TfidfVectorizerorCountVectorizerto convert text to vectors. - Preprocess text by lowercasing and removing stopwords for better results.
- Use
cosine_similarityfromsklearn.metrics.pairwiseto compute similarity. - Similarity score ranges from 0 (no similarity) to 1 (identical).
Key Takeaways
Convert text to numeric vectors using TF-IDF or count vectorization before similarity calculation.
Use cosine similarity to measure how close two text vectors are by the angle between them.
Preprocess text (lowercase, remove stopwords) to improve similarity accuracy.
Never pass raw text strings directly to cosine similarity functions.
Cosine similarity values range from 0 (different) to 1 (identical).
