0
0
NLPml~20 mins

Jaccard similarity in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Jaccard similarity
Problem:Calculate the similarity between two text documents using Jaccard similarity. The current method uses simple token sets but does not handle stopwords or case sensitivity, leading to lower similarity scores than expected.
Current Metrics:Jaccard similarity score between two example texts: 0.35
Issue:The similarity score is low because common words and case differences reduce the overlap of tokens.
Your Task
Improve the Jaccard similarity score by preprocessing the text to remove stopwords and normalize case, aiming to increase the similarity score by at least 0.15.
Do not change the basic Jaccard similarity formula.
Only modify text preprocessing steps.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def jaccard_similarity(text1: str, text2: str) -> float:
    # Convert to lowercase
    text1 = text1.lower()
    text2 = text2.lower()

    # Tokenize by splitting on whitespace
    tokens1 = set(text1.split())
    tokens2 = set(text2.split())

    # Remove stopwords
    tokens1 = tokens1 - ENGLISH_STOP_WORDS
    tokens2 = tokens2 - ENGLISH_STOP_WORDS

    # Calculate intersection and union
    intersection = tokens1.intersection(tokens2)
    union = tokens1.union(tokens2)

    # Avoid division by zero
    if not union:
        return 0.0

    return len(intersection) / len(union)

# Example texts
text_a = "The quick brown fox jumps over the lazy dog"
text_b = "A quick brown dog outpaces a lazy fox"

score = jaccard_similarity(text_a, text_b)
print(f"Improved Jaccard similarity score: {score:.2f}")
Converted all text to lowercase to normalize case differences.
Removed common English stopwords from token sets to focus on meaningful words.
Kept the original Jaccard similarity formula unchanged.
Results Interpretation

Before preprocessing, the Jaccard similarity score was 0.35.

After converting text to lowercase and removing stopwords, the score improved to 0.71.

Preprocessing text by normalizing case and removing common stopwords helps the Jaccard similarity better capture meaningful overlap between documents.
Bonus Experiment
Try using n-grams (like pairs of words) instead of single words for the Jaccard similarity calculation.
💡 Hint
Create sets of word pairs (bigrams) from the texts before computing intersection and union.