| Users/Events | 100 Users | 10,000 Users | 1,000,000 Users | 100,000,000 Users |
|---|---|---|---|---|
| Event Volume | ~10 events/min | ~1,000 events/min | ~100,000 events/min | ~10,000,000 events/min |
| System Components | Single server, simple alerting | Multiple servers, basic load balancing | Distributed servers, advanced routing | Global distributed system, multi-region failover |
| Database Load | Low, single instance | Moderate, read replicas | High, sharded database | Very high, multi-shard, geo-distributed DB |
| Alerting Latency | Seconds | Seconds to sub-second | Sub-second | Milliseconds |
| Storage Needs | GBs | TBs | Petabytes | Exabytes |
| Network Bandwidth | Low | Moderate | High | Very High |
Emergency handling in LLD - Scalability & System Analysis
The database is the first bottleneck as event volume grows. Emergency handling systems require fast writes and reads for alerts and logs. At around 10,000 users generating thousands of events per minute, a single database instance struggles with write throughput and query latency.
- Horizontal Scaling: Add more application servers behind load balancers to handle increased event processing.
- Database Read Replicas: Use replicas to offload read queries and reduce latency.
- Sharding: Partition the database by event type or region to distribute load.
- Caching: Cache frequent queries and alert statuses in fast in-memory stores like Redis.
- Message Queues: Use queues to buffer incoming events and smooth spikes in traffic.
- CDN and Edge Computing: For alert delivery (e.g., notifications), use CDNs and edge nodes to reduce latency globally.
- Multi-region Deployment: Deploy system components in multiple regions for fault tolerance and disaster recovery.
- At 10,000 users generating ~1,000 events/min (~17 events/sec), the system needs to handle ~17 writes/sec plus reads.
- Database write capacity: A single PostgreSQL instance can handle ~5,000 QPS, so write load is manageable initially.
- Storage: Assuming 1 KB per event, 1,000 events/min = ~1.4 MB/min = ~2 GB/month.
- Network bandwidth: 1,000 events/min * 1 KB = ~17 KB/sec, very low at this scale.
- At 1 million users (~100,000 events/min), write load is ~1,666 QPS, requiring sharded DB and caching.
- Bandwidth and storage scale accordingly, requiring distributed storage and efficient data retention policies.
Start by clarifying the expected event volume and latency requirements. Discuss the data flow from event ingestion to alerting. Identify the database as the likely bottleneck early. Propose incremental scaling steps: caching, read replicas, sharding, and multi-region deployment. Emphasize fault tolerance and disaster recovery in emergency systems.
Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Add read replicas to offload read queries and implement caching to reduce database load. If writes are the bottleneck, consider sharding the database to distribute write load across multiple instances.
