0
0
NLPml~20 mins

Cosine similarity in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Cosine similarity
Problem:You want to measure how similar two text sentences are using cosine similarity on their vector representations.
Current Metrics:Cosine similarity scores are computed but sometimes do not reflect true similarity because vectors are not normalized or text preprocessing is missing.
Issue:Cosine similarity values are inconsistent and sometimes low for clearly similar sentences due to lack of proper text vectorization and normalization.
Your Task
Improve cosine similarity calculation so that similar sentences have scores closer to 1 and dissimilar sentences have scores closer to 0.
Use only basic Python and sklearn libraries.
Do not use deep learning models or external APIs.
Keep the code simple and easy to understand.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string

def preprocess(text):
    text = text.lower()
    return text.translate(str.maketrans('', '', string.punctuation))

sentences = [
    "I love machine learning.",
    "Machine learning is my passion!",
    "The sky is blue.",
    "I enjoy sunny days."
]

# Preprocess sentences
clean_sentences = [preprocess(s) for s in sentences]

# Convert sentences to TF-IDF vectors
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(clean_sentences)

# Compute cosine similarity matrix
cosine_sim_matrix = cosine_similarity(vectors)

# Print similarity between first two sentences (expected high similarity)
print(f"Similarity between sentence 1 and 2: {cosine_sim_matrix[0,1]:.3f}")

# Print similarity between first and third sentence (expected low similarity)
print(f"Similarity between sentence 1 and 3: {cosine_sim_matrix[0,2]:.3f}")
Added text preprocessing to lowercase and remove punctuation.
Used TF-IDF vectorizer to convert text into normalized vectors.
Used sklearn's cosine_similarity function to compute similarity between vectors.
Results Interpretation

Before: Cosine similarity scores were inconsistent and sometimes low for similar sentences.
After: Similar sentences have scores around 0.7 indicating strong similarity, while dissimilar sentences have scores near 0.

Proper text preprocessing and vector normalization are essential for meaningful cosine similarity results in NLP tasks.
Bonus Experiment
Try using simple count vectorizer instead of TF-IDF and compare the cosine similarity scores.
💡 Hint
Replace TfidfVectorizer with CountVectorizer and observe how similarity values change.