NLPml~20 mins

Cosine similarity in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Cosine similarity

Problem:You want to measure how similar two text sentences are using cosine similarity on their vector representations.

Current Metrics:Cosine similarity scores are computed but sometimes do not reflect true similarity because vectors are not normalized or text preprocessing is missing.

Issue:Cosine similarity values are inconsistent and sometimes low for clearly similar sentences due to lack of proper text vectorization and normalization.

Your Task

Improve cosine similarity calculation so that similar sentences have scores closer to 1 and dissimilar sentences have scores closer to 0.

Use only basic Python and sklearn libraries.

Do not use deep learning models or external APIs.

Keep the code simple and easy to understand.

Hint 1

Hint 2

Hint 3

Solution

NLP

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string

def preprocess(text):
    text = text.lower()
    return text.translate(str.maketrans('', '', string.punctuation))

sentences = [
    "I love machine learning.",
    "Machine learning is my passion!",
    "The sky is blue.",
    "I enjoy sunny days."
]

# Preprocess sentences
clean_sentences = [preprocess(s) for s in sentences]

# Convert sentences to TF-IDF vectors
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(clean_sentences)

# Compute cosine similarity matrix
cosine_sim_matrix = cosine_similarity(vectors)

# Print similarity between first two sentences (expected high similarity)
print(f"Similarity between sentence 1 and 2: {cosine_sim_matrix[0,1]:.3f}")

# Print similarity between first and third sentence (expected low similarity)
print(f"Similarity between sentence 1 and 3: {cosine_sim_matrix[0,2]:.3f}")

Added text preprocessing to lowercase and remove punctuation.

Used TF-IDF vectorizer to convert text into normalized vectors.

Used sklearn's cosine_similarity function to compute similarity between vectors.

Results Interpretation

Before: Cosine similarity scores were inconsistent and sometimes low for similar sentences.
After: Similar sentences have scores around 0.7 indicating strong similarity, while dissimilar sentences have scores near 0.

Proper text preprocessing and vector normalization are essential for meaningful cosine similarity results in NLP tasks.

Bonus Experiment

Try using simple count vectorizer instead of TF-IDF and compare the cosine similarity scores.

💡 Hint

Replace TfidfVectorizer with CountVectorizer and observe how similarity values change.

Practice

(1/5)

1. What does cosine similarity measure between two vectors?

easy

A. The difference in vector lengths

B. How close the vectors point in the same direction

C. The sum of vector elements

D. The distance between vector origins

Cosine similarity in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand vector comparison

Step 2: Interpret cosine similarity meaning

Final Answer:

Quick Check:

Solution

Step 1: Recall cosine similarity formula

Step 2: Match formula to options

Final Answer:

Quick Check:

Solution

Step 1: Calculate dot product of A and B

Step 2: Calculate norms of A and B

Step 3: Compute cosine similarity

Step 4: Check closest option

Final Answer:

Quick Check:

Solution

Step 1: Analyze denominator in code

Step 2: Understand correct formula

Final Answer:

Quick Check:

Solution

Step 1: Understand sparse vector challenges

Step 2: Identify best practice for cosine similarity

Final Answer:

Quick Check: