Similarity measures help us find text pieces that talk about the same or similar things. They make it easy to group or compare texts without reading everything.
0
0
Why similarity measures find related text in NLP
Introduction
Finding articles that talk about the same news topic.
Recommending similar product reviews to a shopper.
Grouping customer feedback with similar opinions.
Detecting duplicate questions in a forum.
Matching job descriptions with candidate resumes.
Syntax
NLP
similarity_score = similarity_measure(text1_vector, text2_vector)
Text must be converted into numbers (vectors) before measuring similarity.
Common similarity measures include cosine similarity, Jaccard similarity, and Euclidean distance.
Examples
This example shows how to calculate cosine similarity between two short texts.
NLP
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity texts = ['I love apples', 'I like apples', 'I hate bananas'] vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(texts) score = cosine_similarity(vectors[0], vectors[1]) print(score[0][0])
This example calculates Jaccard similarity based on shared words.
NLP
text1 = 'cat dog' text2 = 'dog mouse' set1 = set(text1.split()) set2 = set(text2.split()) jaccard = len(set1 & set2) / len(set1 | set2) print(jaccard)
Sample Model
This program shows how similarity scores are higher for related texts and lower for unrelated ones.
NLP
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity texts = [ 'Machine learning is fun', 'I enjoy learning about machines', 'The sky is blue today' ] vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(texts) # Calculate similarity between first and second text score_0_1 = cosine_similarity(vectors[0], vectors[1])[0][0] # Calculate similarity between first and third text score_0_2 = cosine_similarity(vectors[0], vectors[2])[0][0] print(f'Similarity between text 0 and 1: {score_0_1:.2f}') print(f'Similarity between text 0 and 2: {score_0_2:.2f}')
OutputSuccess
Important Notes
Similarity scores usually range from 0 (no similarity) to 1 (identical).
Choosing the right similarity measure depends on your text and task.
Preprocessing text (like lowercasing, removing stopwords) can improve similarity results.
Summary
Similarity measures help find related texts by comparing their numeric forms.
They are useful in many real-life tasks like recommendations and grouping.
Cosine similarity and Jaccard similarity are common and easy to use.