| Users / Messages | 100 users | 10K users | 1M users | 100M users |
|---|---|---|---|---|
| Message volume per second | ~100 msg/s | ~10K msg/s | ~1M msg/s | ~100M msg/s |
| System components | Single broker, single DB | Multiple brokers, DB replicas | Partitioned brokers, sharded DB | Global distributed brokers, multi-region DB shards |
| Delivery guarantee complexity | Simple ack, retry | Idempotency, deduplication | Exactly-once semantics, ordering | Cross-region consistency, geo-replication |
| Latency | Low, <100ms | Moderate, 100-200ms | Higher, 200-500ms | Variable, 500ms+ |
| Failure handling | Basic retry | Dead-letter queues, monitoring | Advanced retry policies, transactional logs | Multi-region failover, disaster recovery |
Message delivery guarantees in HLD - Scalability & System Analysis
The first bottleneck is the message broker's throughput and storage capacity. At low scale, a single broker handles message delivery with simple acknowledgments. As users grow, the broker's ability to process and persist messages reliably becomes limited. The database or storage system backing the broker also becomes a bottleneck due to write throughput and consistency requirements for delivery guarantees.
- Horizontal scaling: Add more message brokers and partition topics to distribute load.
- Replication: Use broker clusters with replication for fault tolerance and availability.
- Caching: Use in-memory caches for quick message state checks to reduce DB load.
- Sharding: Partition databases by user or topic to scale storage and throughput.
- Idempotency and deduplication: Implement to ensure exactly-once delivery despite retries.
- Dead-letter queues: Handle undeliverable messages separately to avoid blocking.
- Geo-distribution: Deploy brokers and storage in multiple regions for latency and disaster recovery.
- At 10K messages/sec, a single broker can handle ~5K-10K msg/sec, so 2 brokers needed.
- Database write throughput must support message persistence; ~10K writes/sec at 10K msg/sec.
- Storage grows with message retention; 10K msg/sec * 86,400 sec/day = ~864M messages/day.
- Network bandwidth: 10K msg/sec * average message size (e.g., 1KB) = ~10MB/s (~80Mbps).
- At 1M msg/sec, partitioning and sharding are mandatory; storage and network scale accordingly.
Start by clarifying the delivery guarantees needed: at-most-once, at-least-once, or exactly-once. Discuss how these affect system design and complexity. Then, outline scaling steps from single broker to distributed clusters, focusing on bottlenecks and solutions. Use real numbers to justify design choices and show understanding of trade-offs.
Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Add read replicas and implement caching to reduce load on the primary database. Also, consider sharding the database to distribute writes and scale horizontally.
