Bird
Raised Fist0
NlpHow-ToBeginner · 4 min read

How to Calculate Cosine Similarity for Text in NLP

To calculate cosine similarity between texts in NLP, first convert texts into numeric vectors using methods like TF-IDF or CountVectorizer. Then compute the cosine similarity by measuring the cosine of the angle between these vectors, which shows how similar the texts are.
📐

Syntax

To calculate cosine similarity between two text documents, follow these steps:

  • Vectorize texts: Convert text into numeric vectors using CountVectorizer or TfidfVectorizer.
  • Calculate cosine similarity: Use cosine_similarity function from sklearn.metrics.pairwise to get similarity score.

Example syntax:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])
similarity = cosine_similarity(vectors[0], vectors[1])
python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

text1 = "I love machine learning"
text2 = "Machine learning is great"

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])
similarity = cosine_similarity(vectors[0], vectors[1])
print(similarity[0][0])
Output
0.7071067811865475
💻

Example

This example shows how to calculate cosine similarity between two simple sentences using TfidfVectorizer. It converts the sentences into vectors and then computes the cosine similarity score, which ranges from 0 (no similarity) to 1 (identical).

python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Two sample texts
text1 = "I enjoy reading books about AI"
text2 = "Reading about artificial intelligence is fun"

# Convert texts to TF-IDF vectors
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])

# Calculate cosine similarity
similarity_score = cosine_similarity(vectors[0], vectors[1])

print(f"Cosine similarity: {similarity_score[0][0]:.4f}")
Output
Cosine similarity: 0.5669
⚠️

Common Pitfalls

Common mistakes when calculating cosine similarity for text include:

  • Not vectorizing text before similarity calculation, which causes errors.
  • Using raw text strings directly instead of numeric vectors.
  • Ignoring preprocessing like lowercasing or removing stopwords, which can affect similarity.
  • Confusing cosine similarity with Euclidean distance; they measure different things.

Always ensure texts are properly vectorized and preprocessed before calculating cosine similarity.

python
from sklearn.metrics.pairwise import cosine_similarity

# Wrong: passing raw strings
try:
    cosine_similarity(["text one"], ["text two"])
except Exception as e:
    print(f"Error: {e}")

# Right: vectorize first
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(["text one", "text two"])
sim = cosine_similarity(vectors[0], vectors[1])
print(f"Correct similarity: {sim[0][0]:.4f}")
Output
Error: Expected 2D array, got scalar array instead: 'text one' Correct similarity: 0.0000
📊

Quick Reference

Tips for calculating cosine similarity in NLP:

  • Use TfidfVectorizer or CountVectorizer to convert text to vectors.
  • Preprocess text by lowercasing and removing stopwords for better results.
  • Use cosine_similarity from sklearn.metrics.pairwise to compute similarity.
  • Similarity score ranges from 0 (no similarity) to 1 (identical).

Key Takeaways

Convert text to numeric vectors using TF-IDF or count vectorization before similarity calculation.
Use cosine similarity to measure how close two text vectors are by the angle between them.
Preprocess text (lowercase, remove stopwords) to improve similarity accuracy.
Never pass raw text strings directly to cosine similarity functions.
Cosine similarity values range from 0 (different) to 1 (identical).