| Scale | Users / Services | Traffic Characteristics | Infrastructure Changes | Latency & Throughput |
|---|---|---|---|---|
| 100 users | 5-10 internal services | Low QPS, mostly request-response | Single server or small cluster, simple load balancing | Low latency (~ms), high throughput easily handled |
| 10K users | 20-50 internal services | Moderate QPS, some streaming calls | Multiple servers, load balancers, basic service discovery | Latency remains low, throughput increases, some resource contention |
| 1M users | 100+ internal services | High QPS, mix of unary and streaming, complex call graphs | Horizontal scaling, advanced service mesh, distributed tracing | Latency sensitive, throughput near server limits, network bottlenecks appear |
| 100M users | 1000+ internal services | Very high QPS, heavy streaming, multi-region calls | Global clusters, sharded services, aggressive caching, CDN for static data | Latency optimization critical, throughput requires sharding and partitioning |
gRPC for internal services in HLD - Scalability & System Analysis
At small to medium scale, the first bottleneck is the application server CPU and network. gRPC uses HTTP/2 which multiplexes streams but high QPS and streaming calls can saturate CPU and network bandwidth on servers.
At larger scale, the service discovery and load balancing become bottlenecks as the number of services and calls grow. Also, network latency and bandwidth between services in different regions can limit performance.
- Horizontal scaling: Add more instances of services behind load balancers to distribute load.
- Service mesh: Use tools like Istio or Linkerd for advanced routing, retries, and observability.
- Caching: Cache frequent responses to reduce load on services.
- Connection pooling: Reuse gRPC connections to reduce overhead.
- Sharding: Partition services or data to reduce load per instance.
- Compression: Enable gRPC message compression to reduce network usage.
- Multi-region deployment: Deploy services closer to users to reduce latency.
Assuming 1M users generating 10 QPS each internally (10M QPS total):
- Each server handles ~3000 concurrent gRPC streams.
- Need ~3333 servers to handle 10M QPS (10M / 3000 ≈ 3333).
- Network bandwidth per server: If average message size is 10KB, then 3000 * 10KB = ~30MB/s (~240Mbps), within 1Gbps NIC capacity.
- Storage depends on logging and tracing; distributed tracing data can be large and needs separate storage.
Start by clarifying the scale and traffic patterns. Identify the first bottleneck based on expected QPS and message sizes. Discuss horizontal scaling and service mesh for routing and observability. Mention connection reuse and caching to optimize performance. Finally, consider multi-region deployment for latency-sensitive services.
Your database handles 1000 QPS. Traffic grows 10x. What do you do first?
Answer: Add read replicas and implement caching to reduce load on the primary database before scaling application servers.