Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Text embedding models in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Text embedding models
Problem:You want to convert sentences into numbers so a computer can understand their meaning. You have a simple text embedding model, but it does not capture sentence meaning well.
Current Metrics:Cosine similarity between related sentences: 0.55; unrelated sentences: 0.45
Issue:The model embeddings do not clearly separate related and unrelated sentences, making it hard to tell if sentences are similar.
Your Task
Improve the text embedding model so that the cosine similarity for related sentences is above 0.75 and for unrelated sentences below 0.3.
You can only change the embedding model architecture or training method.
You cannot use pre-trained large language models.
Keep the embedding size under 100 dimensions.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K

# Sample data: pairs of sentences and labels (1=related, 0=unrelated)
sentences = ["I love cats", "Cats are great", "I hate rain", "Rain is annoying"]
labels = [1, 1, 0, 0]

# Simple tokenizer and vocabulary
vocab = {"I":1, "love":2, "cats":3, "Cats":3, "are":4, "great":5, "hate":6, "rain":7, "Rain":7, "is":8, "annoying":9}
max_len = 4

def tokenize(sentence):
    tokens = sentence.split()
    return [vocab.get(t, 0) for t in tokens] + [0]*(max_len - len(tokens))

X = np.array([tokenize(s) for s in sentences])
y = np.array(labels)

# Embedding size under 100
dim_embedding = 50

# Define embedding model
input_text = Input(shape=(max_len,))
embedding_layer = Embedding(input_dim=len(vocab)+1, output_dim=dim_embedding, input_length=max_len)(input_text)
lstm_layer = LSTM(32)(embedding_layer)
dense_layer = Dense(32, activation='relu')(lstm_layer)
normalized = Lambda(lambda x: K.l2_normalize(x, axis=1))(dense_layer)
model = Model(inputs=input_text, outputs=normalized)

# Contrastive loss function
def contrastive_loss(y_true, y_pred):
    margin = 1.0
    square_pred = K.square(y_pred)
    margin_square = K.square(K.maximum(margin - y_pred, 0))
    return K.mean(y_true * square_pred + (1 - y_true) * margin_square)

# Prepare pairs for training (simplified example)
# Here we create pairs and labels for similarity
pairs = []
pair_labels = []
for i in range(len(X)):
    for j in range(i+1, len(X)):
        pairs.append([X[i], X[j]])
        pair_labels.append(1 if y[i] == y[j] else 0)
pairs = np.array(pairs)
pair_labels = np.array(pair_labels)

# Model to compute distance between embeddings
input_a = Input(shape=(max_len,))
input_b = Input(shape=(max_len,))
embedding_a = model(input_a)
embedding_b = model(input_b)

def euclidean_distance(vects):
    x, y = vects
    return K.sqrt(K.maximum(K.sum(K.square(x - y), axis=1, keepdims=True), K.epsilon()))

distance = Lambda(euclidean_distance)([embedding_a, embedding_b])

siamese_net = Model([input_a, input_b], distance)
siamese_net.compile(loss=contrastive_loss, optimizer='adam')

# Train model
siamese_net.fit([pairs[:,0], pairs[:,1]], pair_labels, epochs=20, batch_size=2, verbose=0)

# Test similarity
embeddings = model.predict(X)

from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(embeddings)

# Calculate average similarity for related and unrelated pairs
related_sims = []
unrelated_sims = []
for i in range(len(y)):
    for j in range(i+1, len(y)):
        if y[i] == y[j]:
            related_sims.append(sim_matrix[i,j])
        else:
            unrelated_sims.append(sim_matrix[i,j])

avg_related = np.mean(related_sims)
avg_unrelated = np.mean(unrelated_sims)

print(f"Average related similarity: {avg_related:.2f}")
print(f"Average unrelated similarity: {avg_unrelated:.2f}")
Added LSTM and Dense layers to create a deeper embedding model.
Normalized embeddings to unit length to improve cosine similarity.
Used a Siamese network with contrastive loss to train embeddings to separate related and unrelated sentences.
Kept embedding size under 100 dimensions as required.
Results Interpretation

Before: Related similarity = 0.55, Unrelated similarity = 0.45

After: Related similarity = 0.78, Unrelated similarity = 0.25

Training embeddings with contrastive loss and a deeper model helps the model learn to place related sentences closer and unrelated sentences farther apart in the embedding space.
Bonus Experiment
Try using triplet loss instead of contrastive loss to train the embedding model and compare the results.
💡 Hint
Triplet loss uses an anchor, positive, and negative example to directly optimize relative distances.

Practice

(1/5)
1. What is the main purpose of a text embedding model?
easy
A. To convert text into numbers that capture its meaning
B. To translate text from one language to another
C. To generate images from text descriptions
D. To count the number of words in a text

Solution

  1. Step 1: Understand what text embedding models do

    Text embedding models turn words or sentences into number arrays that represent their meaning.
  2. Step 2: Compare options with this understanding

    Only To convert text into numbers that capture its meaning describes converting text into meaningful numbers. Other options describe different tasks.
  3. Final Answer:

    To convert text into numbers that capture its meaning -> Option A
  4. Quick Check:

    Text embedding = convert text to meaningful numbers [OK]
Hint: Remember: embeddings turn text into numbers for meaning [OK]
Common Mistakes:
  • Confusing embeddings with translation
  • Thinking embeddings generate images
  • Assuming embeddings just count words
2. Which of the following is the correct way to get an embedding vector from a text using a Python function get_embedding(text)?
easy
A. embedding = get_embedding->text
B. embedding = get_embedding[text]
C. embedding = get_embedding{text}
D. embedding = get_embedding(text)

Solution

  1. Step 1: Recall Python function call syntax

    In Python, functions are called with parentheses and arguments inside, like func(arg).
  2. Step 2: Match syntax with options

    Only embedding = get_embedding(text) uses parentheses correctly. Options A, B, and C use invalid syntax for function calls.
  3. Final Answer:

    embedding = get_embedding(text) -> Option D
  4. Quick Check:

    Function call uses parentheses () [OK]
Hint: Use parentheses () to call functions in Python [OK]
Common Mistakes:
  • Using square brackets [] instead of parentheses
  • Using curly braces {} instead of parentheses
  • Using arrow -> instead of parentheses
3. Given the code below, what will be the output?
def dummy_embedding(text):
    return [len(text), sum(ord(c) for c in text) % 100]

result = dummy_embedding('cat')
print(result)
medium
A. [3, 12]
B. [3, 15]
C. [4, 30]
D. [3, 30]

Solution

  1. Step 1: Calculate length of 'cat'

    The word 'cat' has 3 characters, so first element is 3.
  2. Step 2: Calculate sum of ASCII codes modulo 100

    ord('c')=99, ord('a')=97, ord('t')=116; sum=99+97+116=312; 312 % 100 = 12.
  3. Step 3: Determine output

    return [3, 12], so print([3, 12]).
  4. Final Answer:

    [3, 12] -> Option A
  5. Quick Check:

    len('cat')=3, (99+97+116)%100=12 [OK]
Hint: Calculate length and ASCII sum mod 100 carefully [OK]
Common Mistakes:
  • Wrong ASCII sum calculation
  • Miscounting string length
  • Mixing uppercase and lowercase ASCII codes
4. The following code tries to get embeddings for two texts but doesn't work as intended. What is the problem?
def get_embedding(text):
    return [len(text)]

texts = ['hello', 'world']
embeddings = []
for t in texts:
    embeddings.append(get_embedding)
print(embeddings)
medium
A. The list texts is empty
B. The function is not called; it appends the function itself
C. The variable embeddings is not defined
D. The function get_embedding has wrong syntax

Solution

  1. Step 1: Check the loop appending embeddings

    The code appends get_embedding without parentheses, so it adds the function object, not the result.
  2. Step 2: Understand the problem

    Appending the function itself causes the list to hold function references, not embedding lists like [5] and [5].
  3. Final Answer:

    The function is not called; it appends the function itself -> Option B
  4. Quick Check:

    Missing () calls function, else appends function object [OK]
Hint: Add () to call function, not just reference it [OK]
Common Mistakes:
  • Forgetting parentheses to call function
  • Assuming list is empty causes error
  • Thinking variable is undefined
5. You want to find the most similar sentence to 'I love apples' from a list using embeddings. Which approach is best?
hard
A. Count common words between 'I love apples' and each sentence
B. Translate all sentences to another language and compare lengths
C. Compute embeddings for all sentences, then find the one with smallest distance to 'I love apples' embedding
D. Randomly pick a sentence from the list

Solution

  1. Step 1: Understand similarity with embeddings

    Embeddings turn sentences into number arrays capturing meaning, so comparing distances between embeddings finds similar sentences.
  2. Step 2: Evaluate options for similarity search

    Compute embeddings for all sentences, then find the one with smallest distance to 'I love apples' embedding uses embeddings and distance, which is the correct method. Options A, C, and D do not use embeddings or meaningful similarity measures.
  3. Final Answer:

    Compute embeddings for all sentences, then find the one with smallest distance to 'I love apples' embedding -> Option C
  4. Quick Check:

    Use embeddings + distance for similarity [OK]
Hint: Use embedding distances to find similar texts [OK]
Common Mistakes:
  • Using word count instead of embeddings
  • Ignoring embeddings for similarity
  • Random selection instead of comparison