| Users / Events | 100 users | 10K users | 1M users | 100M users |
|---|---|---|---|---|
| Event Volume | ~1K events/sec | ~100K events/sec | ~10M events/sec | ~1B events/sec |
| Event Broker Load | Single broker instance | Cluster of brokers | Multi-region broker clusters | Global distributed brokers with partitioning |
| Consumer Instances | Few consumers per service | Scaled consumers with load balancing | Auto-scaling consumers with partition assignment | Thousands of consumers with sharding and geo-distribution |
| Data Storage | Local or small DB | Partitioned DB or NoSQL | Sharded DB clusters or distributed storage | Multi-cloud distributed storage with archiving |
| Latency | Low (ms) | Low to moderate (ms to 10s ms) | Moderate (10s ms to 100s ms) | Higher latency due to geo-distribution (100s ms) |
Event-driven design in LLD - Scalability & System Analysis
At small scale, the event broker (message queue) is the first bottleneck because a single broker instance can handle only a limited number of events per second (around 10K-100K). As event volume grows, broker CPU, memory, and network bandwidth limits are reached first.
- Horizontal Scaling: Add more broker instances forming a cluster to distribute event load.
- Partitioning: Split event streams into partitions so consumers can process in parallel.
- Consumer Scaling: Increase number of consumer instances with load balancing and partition assignment.
- Caching: Use caches for frequently accessed event data to reduce storage load.
- Geo-distribution: Deploy brokers and consumers in multiple regions to reduce latency and increase availability.
- Backpressure and Rate Limiting: Control event production rate to avoid overwhelming the system.
For 10K users generating ~100K events/sec:
- Broker cluster needs to handle 100K events/sec, requiring multiple nodes (each ~20-50K events/sec capacity).
- Consumers must scale to process 100K events/sec, possibly 10-20 instances depending on processing time.
- Storage needs depend on event size; for 1KB events, 100K events/sec = ~100MB/sec = ~8.6TB/day.
- Network bandwidth must support event ingress and egress; 1 Gbps link supports ~125MB/sec, so multiple links or cloud bandwidth needed.
Structure your scalability discussion by first identifying the event volume growth, then pinpoint the bottleneck (usually the event broker). Next, explain how to scale horizontally with clusters and partitions, scale consumers, and manage data storage. Mention latency and geo-distribution considerations. Always justify why each step is needed based on system limits.
Your event broker handles 1,000 events per second. Traffic grows 10x to 10,000 events per second. What do you do first?
Answer: Add more broker instances to form a cluster and partition the event streams to distribute load. This prevents the single broker from becoming a bottleneck and allows consumers to scale processing in parallel.