Jaccard similarity measures how much two sets overlap. It is the size of the shared items divided by the size of all unique items combined. This metric is great when you want to compare text, like sets of words, to see how similar they are. It tells you the fraction of common parts out of everything in both sets.
Jaccard similarity in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Jaccard similarity is not based on a confusion matrix but on set operations. Imagine two sets A and B:
A = {apple, banana, cherry}
B = {banana, cherry, date, fig}
Intersection (A ∩ B) = {banana, cherry} (2 items)
Union (A ∪ B) = {apple, banana, cherry, date, fig} (5 items)
Jaccard similarity = |A ∩ B| / |A ∪ B| = 2 / 5 = 0.4
This means 40% of the combined items are shared.
Jaccard similarity balances both shared and total items, unlike precision or recall alone. For example:
- If you only care about how many shared items are in one set (like precision), you might miss how big the other set is.
- If you only care about how many shared items cover the other set (like recall), you might ignore extra items in the first set.
Jaccard similarity combines these views by dividing the shared items by the total unique items, giving a balanced similarity score.
Jaccard similarity ranges from 0 to 1:
- Good (close to 1): Sets share most items. For example, 0.8 means 80% overlap, showing strong similarity.
- Bad (close to 0): Sets share very few or no items. For example, 0.1 means only 10% overlap, showing weak similarity.
In text comparison, a high Jaccard score means the texts use many of the same words.
- Ignoring set size differences: Small sets can have high similarity by chance.
- Not preprocessing text: Different word forms or cases can lower similarity unfairly.
- Overlooking stop words: Common words like "the" or "and" can inflate similarity.
- Using Jaccard for ordered data: It ignores order, so it may miss important differences.
Your model compares two documents and gets a Jaccard similarity of 0.98. Is this always good? Not necessarily. If the documents are very short or contain many common stop words, the high score might be misleading. Always check the context and preprocess text well.
Practice
Solution
Step 1: Understand the definition of Jaccard similarity
Jaccard similarity is defined as the size of the intersection of two sets divided by the size of their union.Step 2: Compare options with the definition
The size of the intersection divided by the size of the union matches the definition exactly, while others describe different calculations.Final Answer:
The size of the intersection divided by the size of the union -> Option DQuick Check:
Jaccard similarity = intersection / union [OK]
- Confusing union with intersection
- Using subtraction instead of division
- Mixing up numerator and denominator
A and B?Solution
Step 1: Identify set operations for intersection and union
In Python,&is intersection and|is union for sets.Step 2: Check the formula for Jaccard similarity
Jaccard similarity = size of intersection / size of union, which matcheslen(A & B) / len(A | B).Final Answer:
len(A & B) / len(A | B) -> Option BQuick Check:
Intersection & union operators used correctly [OK]
- Swapping intersection and union operators
- Using subtraction instead of intersection
- Adding lengths instead of dividing
A = {'apple', 'banana', 'cherry'} and B = {'banana', 'cherry', 'date', 'fig'}, what is the Jaccard similarity computed by this code?len(A & B) / len(A | B)
Solution
Step 1: Calculate intersection and union of sets A and B
Intersection: {'banana', 'cherry'} has 2 elements.
Union: {'apple', 'banana', 'cherry', 'date', 'fig'} has 5 elements.Step 2: Compute Jaccard similarity
Similarity = 2 / 5 = 0.4.Final Answer:
0.4 -> Option AQuick Check:
2 / 5 = 0.4 [OK]
- Counting union incorrectly
- Using addition instead of division
- Mixing up intersection and union counts
A and B. What is the error?def jaccard(A, B):
return len(A & B) / len(A & B | B)Solution
Step 1: Analyze the denominator expression
The denominator islen(A & B | B). The operator precedence causesA & Bto be evaluated first, then union withB. This results inlen(B), which is incorrect for union of A and B.Step 2: Correct denominator for union
The union should belen(A | B)only. The current expression is wrong and will not compute union correctly.Final Answer:
Incorrect use of union and intersection operators in denominator -> Option CQuick Check:
Union must be A | B, not combined with & [OK]
- Confusing operator precedence
- Using intersection inside union calculation
- Not testing code before use
Solution
Step 1: Calculate initial Jaccard similarity
Intersection = 30
Union = 100 + 80 - 30 = 150
Similarity = 30 / 150 = 0.2Step 2: Calculate similarity after adding 20 common words
New intersection = 30 + 20 = 50
New union = (100 + 20) + (80 + 20) - 50 = 170
Similarity = 50 / 170 ≈ 0.2941, approximately 0.3Final Answer:
Initial similarity 0.2; after adding common words similarity increases to 0.3 -> Option AQuick Check:
Adding common words increases intersection and similarity [OK]
- Forgetting to subtract intersection in union
- Not updating intersection after adding words
- Assuming similarity stays constant
