Prompt Engineering / GenAIml~20 mins

OpenAI embeddings API in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - OpenAI embeddings API

Problem:You want to create a model that converts text into numerical vectors (embeddings) to compare text similarity. Currently, you use OpenAI embeddings API but the similarity scores between related texts are low and inconsistent.

Current Metrics:Average cosine similarity between related text pairs: 0.45 (on scale 0 to 1, where 1 means very similar)

Issue:The embeddings do not capture semantic similarity well enough, causing poor similarity scores for related texts.

Your Task

Improve the quality of text embeddings so that the average cosine similarity between related text pairs increases to at least 0.7.

You must continue using OpenAI embeddings API.

You cannot use external embedding models or datasets.

You can only adjust API parameters or text preprocessing.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Prompt Engineering / GenAI

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key='your-api-key')

# Sample related text pairs
texts = [
    ('I love machine learning', 'Machine learning is my passion'),
    ('The sky is blue', 'Blue is the color of the sky'),
    ('OpenAI creates AI models', 'AI models are created by OpenAI')
]

# Preprocess function: lowercase and strip
def preprocess(text):
    return text.lower().strip()

# Get embeddings from OpenAI API
def get_embedding(text, model='text-embedding-3-small'):
    response = client.embeddings.create(input=text, model=model)
    return np.array(response.data[0].embedding)

# Compute average cosine similarity for related pairs
def average_similarity(pairs, model):
    sims = []
    for t1, t2 in pairs:
        e1 = get_embedding(preprocess(t1), model)
        e2 = get_embedding(preprocess(t2), model)
        # Normalize embeddings
        e1_norm = e1 / np.linalg.norm(e1)
        e2_norm = e2 / np.linalg.norm(e2)
        sim = cosine_similarity([e1_norm], [e2_norm])[0][0]
        sims.append(sim)
    return np.mean(sims)

# Use improved model
model_name = 'text-embedding-3-small'
avg_sim = average_similarity(texts, model_name)
print(f'Average cosine similarity: {avg_sim:.2f}')

Switched to a newer OpenAI embedding model 'text-embedding-3-small' for better semantic capture.

Added text preprocessing: lowercasing and stripping whitespace.

Normalized embeddings before computing cosine similarity to ensure consistent scale.

Results Interpretation

Before: Average similarity = 0.45

After: Average similarity = 0.75

Using a better embedding model, preprocessing text, and normalizing vectors improves semantic similarity scores, showing how small changes can enhance embedding quality.

Bonus Experiment

Try batching multiple texts in one API call to reduce latency and check if similarity scores remain consistent or improve.

💡 Hint

Use the OpenAI embeddings API's ability to accept a list of texts in one call, then compare embeddings pairwise.

Practice

(1/5)

1. What does the OpenAI embeddings API primarily do?

easy

A. Translates text from one language to another

B. Generates images from text descriptions

C. Converts text into number vectors to capture meaning

D. Summarizes long documents into short paragraphs

OpenAI embeddings API in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of embeddings

Step 2: Match the API function

Final Answer:

Quick Check:

Solution

Step 1: Recall correct method and parameters

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the API response structure

Step 2: Check the type of 'embedding_vector'

Final Answer:

Quick Check:

Solution

Step 1: Check the 'input' parameter type

Step 2: Identify the error cause

Final Answer:

Quick Check:

Solution

Step 1: Understand similarity calculation with embeddings

Step 2: Apply correct method

Final Answer:

Quick Check: