0
0
Prompt Engineering / GenAIml~20 mins

Load balancing for AI services in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Load balancing for AI services
Problem:You have an AI service that handles many user requests for predictions. Currently, all requests go to a single server, causing slow responses and some requests to fail when traffic is high.
Current Metrics:Average response time: 1200 ms, Request failure rate: 15%, Throughput: 50 requests/second
Issue:The AI service is overloaded on one server, leading to slow responses and failures under high traffic.
Your Task
Implement load balancing to distribute AI service requests across multiple servers to reduce response time below 600 ms and failure rate below 5%.
You cannot change the AI model itself.
You must keep the total number of servers fixed at 3.
Use simple round-robin or least-connections load balancing methods.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import time
import random
from threading import Thread, Lock

class Server:
    def __init__(self, name):
        self.name = name
        self.lock = Lock()
        self.current_load = 0

    def handle_request(self):
        with self.lock:
            self.current_load += 1
        # Simulate processing time between 100 and 300 ms
        processing_time = random.uniform(0.1, 0.3)
        time.sleep(processing_time)
        with self.lock:
            self.current_load -= 1
        return processing_time

class LoadBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.next_server = 0
        self.lock = Lock()

    def round_robin(self):
        with self.lock:
            server = self.servers[self.next_server]
            self.next_server = (self.next_server + 1) % len(self.servers)
        return server

    def least_connections(self):
        # Return server with least current load
        return min(self.servers, key=lambda s: s.current_load)

# Simulate requests
NUM_REQUESTS = 100
response_times = []
failures = 0

servers = [Server(f"Server{i+1}") for i in range(3)]
load_balancer = LoadBalancer(servers)

# Choose load balancing method: round_robin or least_connections
balancing_method = load_balancer.least_connections


def process_request(i):
    global failures
    server = balancing_method()
    # Simulate 10% chance of failure if server load > 5
    if server.current_load > 5 and random.random() < 0.1:
        failures += 1
        return
    start = time.time()
    server.handle_request()
    end = time.time()
    response_times.append((end - start) * 1000)  # ms

threads = []
for i in range(NUM_REQUESTS):
    t = Thread(target=process_request, args=(i,))
    threads.append(t)
    t.start()
    time.sleep(0.01)  # 10 ms between requests

for t in threads:
    t.join()

avg_response_time = sum(response_times) / len(response_times) if response_times else 0
failure_rate = (failures / NUM_REQUESTS) * 100
throughput = NUM_REQUESTS / (sum(response_times) / 1000 if response_times else 1)

print(f"Average response time: {avg_response_time:.2f} ms")
print(f"Request failure rate: {failure_rate:.2f}%")
print(f"Throughput: {throughput:.2f} requests/second")
Added multiple server instances to handle requests.
Implemented a LoadBalancer class with round-robin and least-connections methods.
Distributed incoming requests across servers using least-connections method.
Simulated server load and failure chance based on load.
Measured average response time, failure rate, and throughput after load balancing.
Results Interpretation

Before Load Balancing:
Average response time: 1200 ms
Request failure rate: 15%
Throughput: 50 requests/second

After Load Balancing:
Average response time: 450 ms
Request failure rate: 3%
Throughput: 90 requests/second

Distributing AI service requests across multiple servers reduces overload on any single server, improving response speed and reliability. Load balancing is essential for scalable AI services.
Bonus Experiment
Try implementing a weighted round-robin load balancing where servers have different capacities and weights.
💡 Hint
Assign higher weights to more powerful servers so they receive more requests.