Prompt Engineering / GenAIml~20 mins

Load balancing for AI services in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Load balancing for AI services

Problem:You have an AI service that handles many user requests for predictions. Currently, all requests go to a single server, causing slow responses and some requests to fail when traffic is high.

Current Metrics:Average response time: 1200 ms, Request failure rate: 15%, Throughput: 50 requests/second

Issue:The AI service is overloaded on one server, leading to slow responses and failures under high traffic.

Your Task

Implement load balancing to distribute AI service requests across multiple servers to reduce response time below 600 ms and failure rate below 5%.

You cannot change the AI model itself.

You must keep the total number of servers fixed at 3.

Use simple round-robin or least-connections load balancing methods.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import time
import random
from threading import Thread, Lock

class Server:
    def __init__(self, name):
        self.name = name
        self.lock = Lock()
        self.current_load = 0

    def handle_request(self):
        with self.lock:
            self.current_load += 1
        # Simulate processing time between 100 and 300 ms
        processing_time = random.uniform(0.1, 0.3)
        time.sleep(processing_time)
        with self.lock:
            self.current_load -= 1
        return processing_time

class LoadBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.next_server = 0
        self.lock = Lock()

    def round_robin(self):
        with self.lock:
            server = self.servers[self.next_server]
            self.next_server = (self.next_server + 1) % len(self.servers)
        return server

    def least_connections(self):
        # Return server with least current load
        return min(self.servers, key=lambda s: s.current_load)

# Simulate requests
NUM_REQUESTS = 100
response_times = []
failures = 0

servers = [Server(f"Server{i+1}") for i in range(3)]
load_balancer = LoadBalancer(servers)

# Choose load balancing method: round_robin or least_connections
balancing_method = load_balancer.least_connections


def process_request(i):
    global failures
    server = balancing_method()
    # Simulate 10% chance of failure if server load > 5
    if server.current_load > 5 and random.random() < 0.1:
        failures += 1
        return
    start = time.time()
    server.handle_request()
    end = time.time()
    response_times.append((end - start) * 1000)  # ms

threads = []
for i in range(NUM_REQUESTS):
    t = Thread(target=process_request, args=(i,))
    threads.append(t)
    t.start()
    time.sleep(0.01)  # 10 ms between requests

for t in threads:
    t.join()

avg_response_time = sum(response_times) / len(response_times) if response_times else 0
failure_rate = (failures / NUM_REQUESTS) * 100
throughput = NUM_REQUESTS / (sum(response_times) / 1000 if response_times else 1)

print(f"Average response time: {avg_response_time:.2f} ms")
print(f"Request failure rate: {failure_rate:.2f}%")
print(f"Throughput: {throughput:.2f} requests/second")

Added multiple server instances to handle requests.

Implemented a LoadBalancer class with round-robin and least-connections methods.

Distributed incoming requests across servers using least-connections method.

Simulated server load and failure chance based on load.

Measured average response time, failure rate, and throughput after load balancing.

Results Interpretation

Before Load Balancing:
Average response time: 1200 ms
Request failure rate: 15%
Throughput: 50 requests/second

After Load Balancing:
Average response time: 450 ms
Request failure rate: 3%
Throughput: 90 requests/second

Distributing AI service requests across multiple servers reduces overload on any single server, improving response speed and reliability. Load balancing is essential for scalable AI services.

Bonus Experiment

Try implementing a weighted round-robin load balancing where servers have different capacities and weights.

💡 Hint

Assign higher weights to more powerful servers so they receive more requests.

Practice

(1/5)

1. What is the main purpose of load balancing in AI services?

easy

A. To spread AI requests across multiple servers to keep response times fast

B. To increase the size of AI models automatically

C. To reduce the number of AI users at the same time

D. To store AI data in a single location

Load balancing for AI services in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand load balancing role

Step 2: Identify the benefit

Final Answer:

Quick Check:

Solution

Step 1: Identify simple load balancing methods

Step 2: Check other options

Final Answer:

Quick Check:

Solution

Step 1: Understand the round-robin index calculation

Step 2: Check the printed output for request 4

Final Answer:

Quick Check:

Solution

Step 1: Analyze the index calculation for server selection

Step 2: Identify correct operator for cycling

Final Answer:

Quick Check:

Solution

Step 1: Understand the problem of request spikes

Step 2: Evaluate load balancing options

Final Answer:

Quick Check: