0
0
NLPml~5 mins

Jaccard similarity in NLP

Choose your learning style9 modes available
Introduction

Jaccard similarity helps us measure how much two sets are alike by comparing what they share versus what they have in total.

Comparing how similar two documents are based on the words they contain.
Finding how much overlap exists between two users' interests or preferences.
Checking similarity between two sets of tags or labels.
Measuring how alike two customer profiles are based on their purchased items.
Syntax
NLP
Jaccard_similarity(A, B) = |A ∩ B| / |A ∪ B|
A and B are sets, like sets of words or items.
The numerator counts items both sets share; the denominator counts all unique items in both.
Examples
Two sets share 2 items (2 and 3), total unique items are 4 (1, 2, 3, 4), so similarity is 0.5.
NLP
A = {1, 2, 3}
B = {2, 3, 4}
Jaccard_similarity = 2 / 4 = 0.5
Only 'banana' is common, total unique fruits are 3, so similarity is about 0.333.
NLP
A = {'apple', 'banana'}
B = {'banana', 'cherry'}
Jaccard_similarity = 1 / 30.333
Sample Model

This code calculates the Jaccard similarity between two sentences by turning them into sets of words and comparing them.

NLP
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union) if len(union) > 0 else 1.0

# Example sets of words from two sentences
sentence1 = "I love machine learning".lower().split()
sentence2 = "I enjoy learning about machines".lower().split()

set1 = set(sentence1)
set2 = set(sentence2)

similarity = jaccard_similarity(set1, set2)
print(f"Jaccard similarity: {similarity:.3f}")
OutputSuccess
Important Notes

Jaccard similarity ranges from 0 (no overlap) to 1 (exact match).

It works best for comparing sets, not sequences or order.

Empty sets compared to empty sets return similarity 1 by definition here.

Summary

Jaccard similarity measures how much two sets overlap.

It is the size of the intersection divided by the size of the union.

Useful for comparing text, tags, or any group of items.