Microservicessystem_design~25 mins

Traffic management (routing, splitting) in Microservices - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Microservices Traffic Management System

Design focuses on traffic routing and splitting layer in front of microservices. Does not cover microservice internal logic or database design. Does not include client-side load balancing.

Functional Requirements

FR1: Route incoming client requests to appropriate microservice instances based on service version or environment.

FR2: Support traffic splitting to gradually shift traffic between different service versions (e.g., canary releases).

FR3: Allow dynamic configuration of routing rules without downtime.

FR4: Provide observability for traffic distribution and routing decisions.

FR5: Ensure minimal added latency (p99 < 50ms) for routing decisions.

FR6: Handle at least 10,000 requests per second with 99.9% availability.

Non-Functional Requirements

NFR1: System must support zero downtime updates to routing rules.

NFR2: Latency overhead for routing must be minimal to avoid user impact.

NFR3: High availability with failover for routing components.

NFR4: Scalable to handle traffic spikes up to 50,000 requests per second.

NFR5: Security: only authorized operators can change routing rules.

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

API Gateway or Ingress Controller

Service Registry or Discovery

Configuration Management System

Traffic Router or Proxy

Monitoring and Logging System

Authentication and Authorization for config changes

Design Patterns

Canary Deployment

Blue-Green Deployment

Feature Flags

Circuit Breaker

Sidecar Proxy Pattern

Reference Architecture

API Gateway / Ingress Controller

↓

Traffic Router / Proxy

↓

Microservice Instances (v1, v2, ...)

↓

Monitoring & Logging

↓

Auth Service

Components

API Gateway / Ingress Controller

NGINX, Envoy, or Kong

Entry point for client requests, performs initial routing and security checks.

Traffic Router / Proxy

Envoy Proxy or custom router

Makes routing and traffic splitting decisions based on configured rules.

Configuration Management

Consul, etcd, or custom config service

Stores routing rules and traffic split percentages, supports dynamic updates.

Microservice Instances

Docker containers or Kubernetes pods

Run different versions of microservices to receive routed traffic.

Monitoring & Logging

Prometheus, Grafana, ELK stack

Collects metrics and logs for traffic distribution and routing decisions.

Authentication & Authorization Service

OAuth2 server or RBAC system

Controls who can update routing configurations.

Request Flow

1. Client sends request to API Gateway.

2. API Gateway forwards request to Traffic Router.

3. Traffic Router fetches routing rules from Configuration Management.

4. Traffic Router decides target microservice version based on rules and traffic split percentages.

5. Traffic Router forwards request to selected microservice instance.

6. Microservice processes request and sends response back through Traffic Router and API Gateway.

7. Monitoring system collects metrics on routing decisions and traffic distribution.

8. Authorized operators update routing rules via Configuration Management secured by Auth Service.

Database Schema

Entities: - RoutingRule: id, service_name, version, criteria (e.g., user segment), traffic_percentage, active_flag - ServiceInstance: id, service_name, version, endpoint, health_status - UserRole: id, user_id, role_name - AuditLog: id, user_id, action, timestamp, details Relationships: - RoutingRule linked to ServiceInstance by service_name and version - UserRole linked to users for authorization - AuditLog records configuration changes by users

Scaling Discussion

Bottlenecks

Traffic Router becomes a bottleneck under very high request rates.

Configuration Management latency affects routing decision speed.

Monitoring system overload with high volume of metrics.

API Gateway limits throughput if not horizontally scalable.

Solutions

Deploy multiple Traffic Router instances behind a load balancer for horizontal scaling.

Cache routing rules locally in Traffic Router with TTL to reduce config fetch latency.

Use sampling and aggregation in Monitoring to reduce data volume.

Use scalable API Gateway solutions with autoscaling and rate limiting.

Interview Tips

Time: 10 minutes for requirements and clarifications, 15 minutes for architecture and components, 10 minutes for scaling and trade-offs, 10 minutes for Q&A.

Clarify routing criteria and traffic splitting needs early.

Explain choice of proxy/router technology and dynamic config management.

Discuss how zero downtime config updates are achieved.

Highlight observability and security considerations.

Address scaling bottlenecks with concrete solutions.