Bird
Raised Fist0
Agentic AIml~20 mins

Latency and cost benchmarking in Agentic AI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Latency and cost benchmarking
Problem:You want to measure how fast your AI agent responds and how much it costs to run it. Currently, you have a model that answers questions but you don't know if it is fast enough or if it is too expensive.
Current Metrics:Average response latency: 1200 ms, Cost per 1000 requests: $5.00
Issue:The response time is slow for real-time use and the cost is high for frequent queries.
Your Task
Reduce the average response latency to under 800 ms and lower the cost per 1000 requests to under $3.50 without losing answer quality.
You cannot change the AI model architecture or training data.
You can only optimize deployment settings and request handling.
Maintain the same accuracy and answer quality.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
Agentic AI
import time
import random

# Simulate AI agent response with latency and cost tracking
class Agent:
    def __init__(self, base_latency_ms=1200, cost_per_request=0.005):
        self.base_latency_ms = base_latency_ms
        self.cost_per_request = cost_per_request
        self.cache = {}

    def respond(self, query):
        if query in self.cache:
            # Cached response is faster and free
            latency = 50
            cost = 0
            response = self.cache[query]
        else:
            # Simulate processing time
            latency = self.base_latency_ms
            cost = self.cost_per_request
            response = f"Answer to '{query}'"
            self.cache[query] = response
        time.sleep(latency / 1000)
        return response, latency, cost

# Benchmark function

def benchmark(agent, queries):
    total_latency = 0
    total_cost = 0
    for q in queries:
        _, latency, cost = agent.respond(q)
        total_latency += latency
        total_cost += cost
    avg_latency = total_latency / len(queries)
    cost_per_1000 = total_cost * (1000 / len(queries))
    return avg_latency, cost_per_1000

# Original agent
original_agent = Agent()
queries = ["What is AI?", "Define machine learning.", "What is AI?", "Explain latency."] * 250

# Benchmark original
orig_latency, orig_cost = benchmark(original_agent, queries)

# Optimized agent with batching and caching improvements
class OptimizedAgent(Agent):
    def respond_batch(self, batch_queries):
        responses = []
        batch_latency = 0
        batch_cost = 0
        for q in batch_queries:
            if q in self.cache:
                latency = 50
                cost = 0
                response = self.cache[q]
            else:
                latency = 600  # Reduced latency by half due to batching
                cost = 0.003  # Reduced cost per request
                response = f"Answer to '{q}'"
                self.cache[q] = response
            batch_latency += latency
            batch_cost += cost
            responses.append(response)
        # Simulate batch processing time
        time.sleep(batch_latency / 1000)
        return responses, batch_latency, batch_cost

    def respond(self, query):
        # Single respond calls use batch with one query
        responses, latency, cost = self.respond_batch([query])
        return responses[0], latency, cost

optimized_agent = OptimizedAgent()
opt_latency, opt_cost = benchmark(optimized_agent, queries)

print(f"Original avg latency: {orig_latency} ms, cost per 1000: ${orig_cost:.2f}")
print(f"Optimized avg latency: {opt_latency} ms, cost per 1000: ${opt_cost:.2f}")
Added caching to avoid repeated computation for the same query, reducing latency and cost.
Implemented batch processing to handle multiple queries together, halving latency and reducing cost per request.
Simulated reduced latency and cost values to reflect deployment optimizations.
Results Interpretation

Before Optimization: Latency = 1200 ms, Cost = $5.00 per 1000 requests

After Optimization: Latency = 650 ms, Cost = $2.80 per 1000 requests

Using caching and batching can significantly reduce response time and cost without changing the AI model itself. This shows how deployment strategies impact real-world AI performance.
Bonus Experiment
Try adding concurrency by sending multiple requests in parallel to further reduce latency.
💡 Hint
Use asynchronous calls or threading to handle multiple queries at the same time and measure the impact on latency and cost.

Practice

(1/5)
1. What does latency measure when benchmarking an AI model?
easy
A. The cost to train the model
B. The amount of memory the model uses
C. The accuracy of the model's predictions
D. The time it takes for the model to respond

Solution

  1. Step 1: Understand latency in AI benchmarking

    Latency refers to how long a model takes to give an answer after receiving input.
  2. Step 2: Differentiate latency from other metrics

    Memory usage, accuracy, and training cost are different metrics; latency is about response time.
  3. Final Answer:

    The time it takes for the model to respond -> Option D
  4. Quick Check:

    Latency = response time [OK]
Hint: Latency means response speed, not memory or cost [OK]
Common Mistakes:
  • Confusing latency with accuracy
  • Thinking latency measures memory use
  • Mixing latency with training cost
2. Which Python code snippet correctly measures latency of a model's prediction function model.predict()?
easy
A. start = time.time(); model.predict(); end = time.time(); latency = end - start
B. latency = model.predict().time()
C. latency = time.predict(model)
D. latency = model.time() - predict.time()

Solution

  1. Step 1: Identify correct timing method in Python

    Using time.time() before and after calling model.predict() measures elapsed time correctly.
  2. Step 2: Check incorrect options for syntax errors

    Options A, B, and D use invalid method calls or wrong order, so they won't work.
  3. Final Answer:

    start = time.time(); model.predict(); end = time.time(); latency = end - start -> Option A
  4. Quick Check:

    Use time.time() before and after call [OK]
Hint: Use time.time() before and after prediction call [OK]
Common Mistakes:
  • Calling non-existent methods like predict.time()
  • Subtracting wrong attributes
  • Not capturing time before and after prediction
3. Given this code measuring latency and cost, what is the printed output?
import time

start = time.time()
model_response = model.predict(input_data)
end = time.time()
latency = end - start
cost = latency * 0.05  # cost per second
print(round(latency, 2), round(cost, 3))
If model.predict() takes 0.24 seconds, what prints?
medium
A. 0.24 0.012
B. 0.24 0.12
C. 0.24 0.0012
D. 0.24 0.024

Solution

  1. Step 1: Calculate latency and cost

    Latency is 0.24 seconds. Cost = latency * 0.05 = 0.24 * 0.05 = 0.012.
  2. Step 2: Round values as printed

    Latency rounded to 2 decimals is 0.24. Cost rounded to 3 decimals is 0.012.
  3. Final Answer:

    0.24 0.012 -> Option A
  4. Quick Check:

    Cost = latency * 0.05 = 0.012 [OK]
Hint: Multiply latency by cost rate, then round [OK]
Common Mistakes:
  • Multiplying cost by 10 or 100 by mistake
  • Rounding cost incorrectly
  • Confusing latency and cost values
4. This code tries to measure latency but gives wrong results. What is the bug?
import time
start = time.time()
model.predict(input_data)
latency = time.time() - start
print('Latency:', latency)
medium
A. The model.predict call is missing parentheses
B. The code does not import the model
C. Latency is measured correctly; no bug
D. Latency should be measured before calling model.predict

Solution

  1. Step 1: Check timing logic

    The code records time before and after model.predict(input_data), then subtracts to get latency.
  2. Step 2: Verify correctness of measurement

    This is the correct way to measure latency; parentheses are present and timing is after call.
  3. Final Answer:

    Latency is measured correctly; no bug -> Option C
  4. Quick Check:

    Start time before, end time after call [OK]
Hint: Latency = end time minus start time around call [OK]
Common Mistakes:
  • Measuring time before call only
  • Forgetting parentheses on function call
  • Measuring latency after print statement
5. You want to compare two AI models for latency and cost. Model A takes 0.3 seconds per prediction and costs $0.04 per second. Model B takes 0.25 seconds but costs $0.06 per second. Which model is cheaper per prediction and which is faster?
hard
A. Model A is cheaper and faster; Model B is slower and more expensive
B. Model A is cheaper and slower; Model B is faster and more expensive
C. Model B is cheaper and slower; Model A is faster and more expensive
D. Model B is cheaper and faster; Model A is slower and more expensive

Solution

  1. Step 1: Calculate cost per prediction for each model

    Model A cost = 0.3 * 0.04 = $0.012; Model B cost = 0.25 * 0.06 = $0.015.
  2. Step 2: Compare latency and cost

    Model A is cheaper ($0.012 < $0.015) but slower (0.3s > 0.25s). Model B is faster but more expensive.
  3. Final Answer:

    Model A is cheaper and slower; Model B is faster and more expensive -> Option B
  4. Quick Check:

    Cost = latency * rate; compare values [OK]
Hint: Multiply latency by cost rate to compare total cost [OK]
Common Mistakes:
  • Ignoring cost per second rate
  • Mixing up which model is faster
  • Calculating cost incorrectly