Bird
Raised Fist0
NLPml~8 mins

Jaccard similarity in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Jaccard similarity
Which metric matters for Jaccard similarity and WHY

Jaccard similarity measures how much two sets overlap. It is the size of the shared items divided by the size of all unique items combined. This metric is great when you want to compare text, like sets of words, to see how similar they are. It tells you the fraction of common parts out of everything in both sets.

Confusion matrix or equivalent visualization

Jaccard similarity is not based on a confusion matrix but on set operations. Imagine two sets A and B:

    A = {apple, banana, cherry}
    B = {banana, cherry, date, fig}

    Intersection (A ∩ B) = {banana, cherry} (2 items)
    Union (A ∪ B) = {apple, banana, cherry, date, fig} (5 items)

    Jaccard similarity = |A ∩ B| / |A ∪ B| = 2 / 5 = 0.4
    

This means 40% of the combined items are shared.

Precision vs Recall tradeoff with examples

Jaccard similarity balances both shared and total items, unlike precision or recall alone. For example:

  • If you only care about how many shared items are in one set (like precision), you might miss how big the other set is.
  • If you only care about how many shared items cover the other set (like recall), you might ignore extra items in the first set.

Jaccard similarity combines these views by dividing the shared items by the total unique items, giving a balanced similarity score.

What "good" vs "bad" Jaccard similarity values look like

Jaccard similarity ranges from 0 to 1:

  • Good (close to 1): Sets share most items. For example, 0.8 means 80% overlap, showing strong similarity.
  • Bad (close to 0): Sets share very few or no items. For example, 0.1 means only 10% overlap, showing weak similarity.

In text comparison, a high Jaccard score means the texts use many of the same words.

Common pitfalls when using Jaccard similarity
  • Ignoring set size differences: Small sets can have high similarity by chance.
  • Not preprocessing text: Different word forms or cases can lower similarity unfairly.
  • Overlooking stop words: Common words like "the" or "and" can inflate similarity.
  • Using Jaccard for ordered data: It ignores order, so it may miss important differences.
Self-check question

Your model compares two documents and gets a Jaccard similarity of 0.98. Is this always good? Not necessarily. If the documents are very short or contain many common stop words, the high score might be misleading. Always check the context and preprocess text well.

Key Result
Jaccard similarity measures overlap between sets as shared items divided by total unique items, ranging from 0 (no overlap) to 1 (identical).

Practice

(1/5)
1. What does the Jaccard similarity measure between two sets?
easy
A. The difference between the sizes of the two sets
B. The size of the union divided by the size of the intersection
C. The sum of the sizes of the two sets
D. The size of the intersection divided by the size of the union

Solution

  1. Step 1: Understand the definition of Jaccard similarity

    Jaccard similarity is defined as the size of the intersection of two sets divided by the size of their union.
  2. Step 2: Compare options with the definition

    The size of the intersection divided by the size of the union matches the definition exactly, while others describe different calculations.
  3. Final Answer:

    The size of the intersection divided by the size of the union -> Option D
  4. Quick Check:

    Jaccard similarity = intersection / union [OK]
Hint: Remember: overlap divided by total unique items [OK]
Common Mistakes:
  • Confusing union with intersection
  • Using subtraction instead of division
  • Mixing up numerator and denominator
2. Which of the following Python code snippets correctly calculates the Jaccard similarity between two sets A and B?
easy
A. len(A | B) / len(A & B)
B. len(A & B) / len(A | B)
C. len(A - B) / len(B - A)
D. len(A) + len(B)

Solution

  1. Step 1: Identify set operations for intersection and union

    In Python, & is intersection and | is union for sets.
  2. Step 2: Check the formula for Jaccard similarity

    Jaccard similarity = size of intersection / size of union, which matches len(A & B) / len(A | B).
  3. Final Answer:

    len(A & B) / len(A | B) -> Option B
  4. Quick Check:

    Intersection & union operators used correctly [OK]
Hint: Use & for intersection, | for union in Python sets [OK]
Common Mistakes:
  • Swapping intersection and union operators
  • Using subtraction instead of intersection
  • Adding lengths instead of dividing
3. Given two sets A = {'apple', 'banana', 'cherry'} and B = {'banana', 'cherry', 'date', 'fig'}, what is the Jaccard similarity computed by this code?
len(A & B) / len(A | B)
medium
A. 0.4
B. 0.5
C. 0.6
D. 0.75

Solution

  1. Step 1: Calculate intersection and union of sets A and B

    Intersection: {'banana', 'cherry'} has 2 elements.
    Union: {'apple', 'banana', 'cherry', 'date', 'fig'} has 5 elements.
  2. Step 2: Compute Jaccard similarity

    Similarity = 2 / 5 = 0.4.
  3. Final Answer:

    0.4 -> Option A
  4. Quick Check:

    2 / 5 = 0.4 [OK]
Hint: Count common and total unique items, then divide [OK]
Common Mistakes:
  • Counting union incorrectly
  • Using addition instead of division
  • Mixing up intersection and union counts
4. The following code is intended to compute the Jaccard similarity between two sets A and B. What is the error?
def jaccard(A, B):
    return len(A & B) / len(A & B | B)
medium
A. Function missing return statement
B. Division by zero error possible
C. Incorrect use of union and intersection operators in denominator
D. Sets A and B are not defined

Solution

  1. Step 1: Analyze the denominator expression

    The denominator is len(A & B | B). The operator precedence causes A & B to be evaluated first, then union with B. This results in len(B), which is incorrect for union of A and B.
  2. Step 2: Correct denominator for union

    The union should be len(A | B) only. The current expression is wrong and will not compute union correctly.
  3. Final Answer:

    Incorrect use of union and intersection operators in denominator -> Option C
  4. Quick Check:

    Union must be A | B, not combined with & [OK]
Hint: Use parentheses or correct operators for union [OK]
Common Mistakes:
  • Confusing operator precedence
  • Using intersection inside union calculation
  • Not testing code before use
5. You want to compare two documents by their unique words using Jaccard similarity. Document 1 has 100 unique words, Document 2 has 80 unique words, and they share 30 unique words. What is the Jaccard similarity? Also, if you add 20 common words to both documents, how does the similarity change?
hard
A. Initial similarity 0.2; after adding common words similarity increases to 0.3
B. Initial similarity 0.15; after adding common words similarity decreases
C. Initial similarity 0.25; after adding common words similarity stays the same
D. Initial similarity 0.18; after adding common words similarity increases to 0.33

Solution

  1. Step 1: Calculate initial Jaccard similarity

    Intersection = 30
    Union = 100 + 80 - 30 = 150
    Similarity = 30 / 150 = 0.2
  2. Step 2: Calculate similarity after adding 20 common words

    New intersection = 30 + 20 = 50
    New union = (100 + 20) + (80 + 20) - 50 = 170
    Similarity = 50 / 170 ≈ 0.2941, approximately 0.3
  3. Final Answer:

    Initial similarity 0.2; after adding common words similarity increases to 0.3 -> Option A
  4. Quick Check:

    Adding common words increases intersection and similarity [OK]
Hint: Adding shared items increases similarity numerator and denominator [OK]
Common Mistakes:
  • Forgetting to subtract intersection in union
  • Not updating intersection after adding words
  • Assuming similarity stays constant