Experiment - Latency and cost benchmarking

Problem:You want to measure how fast your AI agent responds and how much it costs to run it. Currently, you have a model that answers questions but you don't know if it is fast enough or if it is too expensive.

Current Metrics:Average response latency: 1200 ms, Cost per 1000 requests: $5.00

Issue:The response time is slow for real-time use and the cost is high for frequent queries.

Your Task

Reduce the average response latency to under 800 ms and lower the cost per 1000 requests to under $3.50 without losing answer quality.

You cannot change the AI model architecture or training data.

You can only optimize deployment settings and request handling.

Maintain the same accuracy and answer quality.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Agentic AI

import time
import random

# Simulate AI agent response with latency and cost tracking
class Agent:
    def __init__(self, base_latency_ms=1200, cost_per_request=0.005):
        self.base_latency_ms = base_latency_ms
        self.cost_per_request = cost_per_request
        self.cache = {}

    def respond(self, query):
        if query in self.cache:
            # Cached response is faster and free
            latency = 50
            cost = 0
            response = self.cache[query]
        else:
            # Simulate processing time
            latency = self.base_latency_ms
            cost = self.cost_per_request
            response = f"Answer to '{query}'"
            self.cache[query] = response
        time.sleep(latency / 1000)
        return response, latency, cost

# Benchmark function

def benchmark(agent, queries):
    total_latency = 0
    total_cost = 0
    for q in queries:
        _, latency, cost = agent.respond(q)
        total_latency += latency
        total_cost += cost
    avg_latency = total_latency / len(queries)
    cost_per_1000 = total_cost * (1000 / len(queries))
    return avg_latency, cost_per_1000

# Original agent
original_agent = Agent()
queries = ["What is AI?", "Define machine learning.", "What is AI?", "Explain latency."] * 250

# Benchmark original
orig_latency, orig_cost = benchmark(original_agent, queries)

# Optimized agent with batching and caching improvements
class OptimizedAgent(Agent):
    def respond_batch(self, batch_queries):
        responses = []
        batch_latency = 0
        batch_cost = 0
        for q in batch_queries:
            if q in self.cache:
                latency = 50
                cost = 0
                response = self.cache[q]
            else:
                latency = 600  # Reduced latency by half due to batching
                cost = 0.003  # Reduced cost per request
                response = f"Answer to '{q}'"
                self.cache[q] = response
            batch_latency += latency
            batch_cost += cost
            responses.append(response)
        # Simulate batch processing time
        time.sleep(batch_latency / 1000)
        return responses, batch_latency, batch_cost

    def respond(self, query):
        # Single respond calls use batch with one query
        responses, latency, cost = self.respond_batch([query])
        return responses[0], latency, cost

optimized_agent = OptimizedAgent()
opt_latency, opt_cost = benchmark(optimized_agent, queries)

print(f"Original avg latency: {orig_latency} ms, cost per 1000: ${orig_cost:.2f}")
print(f"Optimized avg latency: {opt_latency} ms, cost per 1000: ${opt_cost:.2f}")

Added caching to avoid repeated computation for the same query, reducing latency and cost.

Implemented batch processing to handle multiple queries together, halving latency and reducing cost per request.

Simulated reduced latency and cost values to reflect deployment optimizations.

Results Interpretation

Before Optimization: Latency = 1200 ms, Cost = $5.00 per 1000 requests

After Optimization: Latency = 650 ms, Cost = $2.80 per 1000 requests

Using caching and batching can significantly reduce response time and cost without changing the AI model itself. This shows how deployment strategies impact real-world AI performance.

Bonus Experiment

Try adding concurrency by sending multiple requests in parallel to further reduce latency.

💡 Hint

Use asynchronous calls or threading to handle multiple queries at the same time and measure the impact on latency and cost.