HLDsystem_design~25 mins

Global server load balancing (GSLB) in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Global Server Load Balancing (GSLB) System

Design focuses on DNS-based global load balancing and health monitoring of data centers. Does not cover internal data center load balancing or application logic.

Functional Requirements

FR1: Distribute user requests across multiple geographically distributed data centers

FR2: Automatically route users to the closest or best-performing data center

FR3: Provide failover in case a data center becomes unavailable

FR4: Support DNS-based load balancing with low latency

FR5: Handle at least 1 million concurrent users globally

FR6: Ensure p99 DNS resolution latency under 100ms

FR7: Provide 99.9% availability for routing service

Non-Functional Requirements

NFR1: Must work across multiple regions and continents

NFR2: DNS TTL should be configurable but typically low (e.g., 30 seconds)

NFR3: System must handle sudden traffic spikes gracefully

NFR4: Data centers may have different capacities and health status

NFR5: Latency and network conditions vary by user location

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Global DNS servers with authoritative zones

Health check service for data centers

Traffic routing logic (geo-IP, latency measurement)

Configuration management for data center metadata

Monitoring and alerting system

Cache and TTL management

Design Patterns

DNS-based load balancing

Health check and failover pattern

Geo-location routing

Weighted round-robin or latency-based routing

Caching and TTL optimization

Reference Architecture

          +---------------------+
          |   User DNS Resolver  |
          +----------+----------+
                     |
                     | DNS Query
                     v
          +---------------------+       +---------------------+
          | Global DNS Servers   |<----->| Health Check Service |
          | (Authoritative DNS)  |       +---------------------+
          +----------+----------+
                     |
          +----------+----------+----------+
          |          |          |          |
          v          v          v          v
    Data Center  Data Center  Data Center  Data Center
    (Region A)  (Region B)  (Region C)  (Region D)

Components

Global DNS Servers

Authoritative DNS servers (e.g., Bind, NSD, or cloud DNS)

Respond to DNS queries with IP addresses of the best data center based on routing logic

Health Check Service

Custom service or monitoring tools (e.g., Prometheus, Nagios)

Continuously monitor health and availability of each data center

Routing Logic Module

Custom software or DNS policy engine

Decide which data center IP to return based on geo-location, latency, health, and capacity

Configuration Management

Database or config files

Store metadata about data centers, weights, and routing policies

Monitoring and Alerting

Monitoring tools (e.g., Grafana, PagerDuty)

Track system health, DNS latency, and alert on failures

Request Flow

1. User's device sends DNS query to local DNS resolver.

2. Local DNS resolver forwards query to Global DNS Servers authoritative for the domain.

3. Global DNS Servers invoke Routing Logic Module to select the best data center IP.

4. Routing Logic uses geo-IP lookup, health status, and latency data to pick data center.

5. Global DNS Servers respond with IP address of selected data center.

6. User connects to the selected data center's IP for service.

7. Health Check Service continuously probes data centers and updates their health status.

8. Routing Logic updates decisions based on health and capacity changes.

9. Monitoring system tracks DNS response times and data center availability.

Database Schema

Entities: - DataCenter(id, name, region, ip_addresses, capacity, status) - HealthCheck(id, data_center_id, timestamp, status, latency) - RoutingPolicy(id, criteria_type, parameters, weight) Relationships: - Each DataCenter has many HealthCheck records - RoutingPolicy defines rules applied to DataCenters for selection

Scaling Discussion

Bottlenecks

Global DNS servers can become overwhelmed by high query volume

Health check service may lag in detecting failures at scale

Geo-IP lookups can add latency if not cached efficiently

DNS caching by clients and ISPs can delay failover

Routing logic complexity can increase latency in DNS responses

Solutions

Deploy multiple anycast DNS servers globally to distribute query load

Use distributed health check agents close to data centers for faster detection

Cache geo-IP results and use efficient lookup libraries

Set low DNS TTL values and use DNS features like DNS push updates if supported

Optimize routing logic with precomputed decisions and caching

Interview Tips

Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain how DNS-based routing works and why it's suitable for GSLB

Discuss health checks and failover mechanisms

Describe how geo-location and latency influence routing decisions

Mention DNS caching challenges and TTL trade-offs

Highlight scalability strategies like anycast DNS and distributed health checks