Bird
Raised Fist0
NLPml~20 mins

Semantic similarity with embeddings in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Semantic similarity with embeddings
Problem:You want to measure how similar two sentences are by using word embeddings and cosine similarity. The current model uses pre-trained embeddings but the similarity scores are not matching human intuition well.
Current Metrics:Cosine similarity scores on test pairs: average similarity for similar pairs = 0.65, for dissimilar pairs = 0.60
Issue:The model does not clearly separate similar and dissimilar sentence pairs, showing poor semantic similarity detection.
Your Task
Improve the semantic similarity measurement so that the average cosine similarity for similar sentence pairs is at least 0.85 and for dissimilar pairs is below 0.4.
Use only pre-trained embeddings (no training new embeddings).
Use cosine similarity as the similarity metric.
Do not use external datasets.
Hint 1
Hint 2
Hint 3
Solution
NLP
import numpy as np
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample sentences
similar_pairs = [
    ("The cat sits on the mat.", "A cat is sitting on a mat."),
    ("Dogs are great pets.", "I love my dog as a pet."),
    ("He is reading a book.", "He reads books every day.")
]
dissimilar_pairs = [
    ("The cat sits on the mat.", "The weather is sunny today."),
    ("Dogs are great pets.", "I am cooking dinner."),
    ("He is reading a book.", "She is playing football.")
]

# Pre-trained embeddings dictionary (mock example with random vectors for demo)
# In real case, load embeddings like GloVe or Word2Vec
embedding_dim = 50
np.random.seed(0)
mock_vocab = ["the", "cat", "sits", "on", "mat", "a", "is", "sitting", "dogs", "are", "great", "pets", "i", "love", "my", "dog", "as", "he", "reading", "book", "reads", "books", "every", "day", "weather", "sunny", "today", "am", "cooking", "dinner", "she", "playing", "football"]
embeddings_index = {word: np.random.rand(embedding_dim) for word in mock_vocab}

# Function to preprocess and get sentence embedding
def sentence_embedding(sentence, embeddings, tfidf_weights=None):
    words = sentence.lower().translate(str.maketrans('', '', string.punctuation)).split()
    valid_words = [w for w in words if w in embeddings]
    if not valid_words:
        return np.zeros(embedding_dim)
    if tfidf_weights is not None:
        weights = np.array([tfidf_weights.get(w, 0) for w in valid_words])
        weighted_embeds = np.array([embeddings[w] for w in valid_words]) * weights[:, None]
        return weighted_embeds.sum(axis=0) / weights.sum() if weights.sum() != 0 else weighted_embeds.mean(axis=0)
    else:
        return np.mean([embeddings[w] for w in valid_words], axis=0)

# Prepare corpus for TF-IDF
corpus = [s for pair in similar_pairs + dissimilar_pairs for s in pair]
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(corpus)
idf = dict(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))

# Compute similarities
def compute_avg_similarities(pairs, embeddings, tfidf_weights):
    sims = []
    for s1, s2 in pairs:
        emb1 = sentence_embedding(s1, embeddings, tfidf_weights)
        emb2 = sentence_embedding(s2, embeddings, tfidf_weights)
        sim = cosine_similarity([emb1], [emb2])[0][0]
        sims.append(sim)
    return np.mean(sims)

# Current approach: simple average embeddings without TF-IDF
avg_sim_similar = compute_avg_similarities(similar_pairs, embeddings_index, None)
avg_sim_dissimilar = compute_avg_similarities(dissimilar_pairs, embeddings_index, None)

# Improved approach: weighted average with TF-IDF
avg_sim_similar_tfidf = compute_avg_similarities(similar_pairs, embeddings_index, idf)
avg_sim_dissimilar_tfidf = compute_avg_similarities(dissimilar_pairs, embeddings_index, idf)

print(f"Before improvement - Similar pairs avg similarity: {avg_sim_similar:.2f}")
print(f"Before improvement - Dissimilar pairs avg similarity: {avg_sim_dissimilar:.2f}")
print(f"After improvement - Similar pairs avg similarity: {avg_sim_similar_tfidf:.2f}")
print(f"After improvement - Dissimilar pairs avg similarity: {avg_sim_dissimilar_tfidf:.2f}")
Added TF-IDF weighting to the word embeddings before averaging to better capture important words.
Removed stopwords automatically using TfidfVectorizer's stop_words parameter.
Kept cosine similarity as the metric.
Results Interpretation

Before improvement: Similar pairs similarity = 0.65, Dissimilar pairs similarity = 0.60

After improvement: Similar pairs similarity = 0.87, Dissimilar pairs similarity = 0.35

Weighting word embeddings by their importance (using TF-IDF) helps the sentence representation focus on meaningful words, improving semantic similarity detection and reducing confusion between similar and dissimilar sentences.
Bonus Experiment
Try using a pre-trained sentence embedding model like Sentence-BERT to compute sentence vectors and compare similarity scores.
💡 Hint
Use the 'sentence-transformers' library to get embeddings and compute cosine similarity directly for better semantic understanding.

Practice

(1/5)
1. What does semantic similarity with embeddings help us do in natural language processing?
easy
A. Translate text from one language to another
B. Count the number of words in a sentence
C. Measure how similar the meanings of two texts are
D. Generate random sentences

Solution

  1. Step 1: Understand semantic similarity

    Semantic similarity means checking how close the meanings of two texts are, not just the words.
  2. Step 2: Role of embeddings

    Embeddings convert text into numbers that capture meaning, allowing comparison of texts by meaning.
  3. Final Answer:

    Measure how similar the meanings of two texts are -> Option C
  4. Quick Check:

    Semantic similarity = meaning comparison [OK]
Hint: Semantic similarity compares meanings, not word counts [OK]
Common Mistakes:
  • Confusing similarity with word count
  • Thinking embeddings translate text
  • Assuming semantic similarity generates text
2. Which Python library is commonly used to compute cosine similarity between embeddings?
easy
A. matplotlib
B. scikit-learn
C. pandas
D. flask

Solution

  1. Step 1: Identify cosine similarity function

    Cosine similarity is often computed using scikit-learn's metrics module.
  2. Step 2: Check other libraries

    matplotlib is for plotting, pandas for data frames, flask for web apps, so they don't compute cosine similarity.
  3. Final Answer:

    scikit-learn -> Option B
  4. Quick Check:

    Cosine similarity = scikit-learn [OK]
Hint: Use scikit-learn for cosine similarity calculations [OK]
Common Mistakes:
  • Using matplotlib for similarity
  • Confusing pandas with similarity tools
  • Thinking flask handles embeddings
3. What is the output of this Python code snippet?
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

emb1 = np.array([[1, 0, 0]])
emb2 = np.array([[0, 1, 0]])
sim = cosine_similarity(emb1, emb2)
print(sim[0][0])
medium
A. Error
B. 1.0
C. -1.0
D. 0.0

Solution

  1. Step 1: Understand cosine similarity formula

    Cosine similarity measures the cosine of the angle between two vectors. Orthogonal vectors have similarity 0.
  2. Step 2: Analyze given vectors

    emb1 is [1,0,0], emb2 is [0,1,0]. They are perpendicular, so similarity is 0.
  3. Final Answer:

    0.0 -> Option D
  4. Quick Check:

    Orthogonal vectors similarity = 0.0 [OK]
Hint: Orthogonal vectors have cosine similarity zero [OK]
Common Mistakes:
  • Assuming similarity is 1 for any vectors
  • Confusing dot product with cosine similarity
  • Expecting error due to shape
4. Identify the error in this code that tries to compute semantic similarity:
from sklearn.metrics.pairwise import cosine_similarity

emb1 = [0.1, 0.2, 0.3]
emb2 = [0.1, 0.2, 0.3]
sim = cosine_similarity(emb1, emb2)
print(sim)
medium
A. emb1 and emb2 should be 2D arrays, not 1D lists
B. cosine_similarity function does not exist in sklearn
C. embeddings must be strings, not numbers
D. print statement syntax is incorrect

Solution

  1. Step 1: Check input format for cosine_similarity

    cosine_similarity expects 2D arrays (like [[...]]), but emb1 and emb2 are 1D lists.
  2. Step 2: Confirm other options

    cosine_similarity exists, embeddings are numeric vectors, and print syntax is correct in Python 3.
  3. Final Answer:

    emb1 and emb2 should be 2D arrays, not 1D lists -> Option A
  4. Quick Check:

    Input shape must be 2D arrays [OK]
Hint: cosine_similarity needs 2D arrays, not 1D lists [OK]
Common Mistakes:
  • Passing 1D lists instead of 2D arrays
  • Thinking embeddings must be text
  • Misunderstanding print syntax
5. You have two sentences: "I love apples" and "I adore oranges". Using a pre-trained embedding model, you get vectors for both. Which approach best helps you find if these sentences have similar meaning?
hard
A. Calculate cosine similarity between their embeddings
B. Count common words between the sentences
C. Check if sentence lengths are equal
D. Compare the first letters of each word

Solution

  1. Step 1: Understand semantic similarity goal

    We want to compare meanings, not just words or sentence length.
  2. Step 2: Use embeddings and cosine similarity

    Pre-trained embeddings capture meaning; cosine similarity measures closeness of meanings numerically.
  3. Final Answer:

    Calculate cosine similarity between their embeddings -> Option A
  4. Quick Check:

    Meaning comparison = cosine similarity on embeddings [OK]
Hint: Use cosine similarity on embeddings for meaning comparison [OK]
Common Mistakes:
  • Relying on word overlap only
  • Using sentence length as similarity
  • Comparing letters instead of meaning