0
0
NLPml~5 mins

Cosine similarity in NLP

Choose your learning style9 modes available
Introduction
Cosine similarity helps us measure how alike two things are by looking at the angle between their features, ignoring their size.
Comparing how similar two documents or sentences are in meaning.
Finding similar users or items in recommendation systems.
Grouping similar images or texts together.
Checking if two pieces of text talk about the same topic.
Measuring similarity between word vectors in language models.
Syntax
NLP
cosine_similarity = (A ยท B) / (||A|| * ||B||)

Where:
- A and B are vectors
- ยท means dot product
- ||A|| means length (magnitude) of vector A
Vectors A and B must have the same number of features.
Cosine similarity ranges from -1 (opposite) to 1 (same direction). Usually, values are between 0 and 1 for non-negative data.
Examples
Calculates cosine similarity between two simple 3D vectors.
NLP
A = [1, 0, 1]
B = [0, 1, 1]
cosine_similarity = (1*0 + 0*1 + 1*1) / (sqrt(1**2+0**2+1**2) * sqrt(0**2+1**2+1**2)) = 1 / (sqrt(2)*sqrt(2)) = 0.5
Vectors pointing in the same direction have cosine similarity 1.
NLP
A = [2, 3]
B = [4, 6]
cosine_similarity = (2*4 + 3*6) / (sqrt(2**2+3**2) * sqrt(4**2+6**2)) = 26 / (sqrt(13)*sqrt(52)) = 1.0
Sample Model
This code uses sklearn to find cosine similarity between two vectors representing text features.
NLP
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Two example text vectors (e.g., TF-IDF or embeddings)
vector1 = np.array([[1, 2, 3, 4]])
vector2 = np.array([[4, 3, 2, 1]])

# Calculate cosine similarity
similarity = cosine_similarity(vector1, vector2)

print(f"Cosine similarity: {similarity[0][0]:.4f}")
OutputSuccess
Important Notes
Cosine similarity ignores the length of vectors, focusing on direction.
It works well when magnitude differences are not important.
For text, vectors often come from word counts, TF-IDF, or embeddings.
Summary
Cosine similarity measures how close two vectors point in the same direction.
It is useful for comparing texts, users, or items based on features.
Values range from -1 to 1, with 1 meaning very similar.