| Users / Messages | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| 100 users / 1K msg/sec | Works well; single broker or small cluster | Works well; single node or small cluster | Works well; fully managed, no setup needed |
| 10K users / 100K msg/sec | Needs multi-broker cluster, partitioning, replication | Needs clustering, federation; may face throughput limits | Handles scale easily; pay per request; latency may increase |
| 1M users / 1M+ msg/sec | Large cluster with many partitions; careful tuning needed | Scaling harder; may require sharding or multiple clusters | Scales automatically; cost and latency considerations rise |
| 100M users / 10M+ msg/sec | Very large cluster; complex management; high ops effort | Not ideal; likely multiple RabbitMQ clusters or redesign | Still scales; cost and throttling become major factors |
Kafka vs RabbitMQ vs SQS in HLD - Scaling Approaches Compared
Kafka: Broker disk I/O and network bandwidth limit throughput first because Kafka stores messages on disk and replicates them across brokers.
RabbitMQ: Broker CPU and memory become bottlenecks early due to message routing and in-memory queues.
SQS: Latency and cost become bottlenecks at very high scale since it is a managed service with request limits and pricing per request.
- Kafka: Add more brokers, increase partitions for parallelism, use replication for fault tolerance, optimize disk and network.
- RabbitMQ: Use clustering and federation to distribute load, shard queues, optimize routing, and offload consumers.
- SQS: Use multiple queues to distribute load, batch requests to reduce cost, and leverage AWS autoscaling for consumers.
- Kafka: 1 broker handles ~1000-5000 concurrent connections; 1 partition ~100K msg/sec; storage depends on retention (e.g., 1TB per broker for logs); network ~1 Gbps per broker.
- RabbitMQ: Single node handles ~10K msg/sec; clustering needed beyond that; memory usage grows with queue size; network bandwidth ~100-500 Mbps per node.
- SQS: Handles millions of requests per second; cost is $0.40 per million requests; storage is managed; bandwidth depends on message size and request volume.
Start by clarifying workload size and message patterns. Identify bottlenecks by scale. Discuss trade-offs: operational complexity (Kafka), ease of use (SQS), and flexibility (RabbitMQ). Propose scaling steps matching bottlenecks. Mention cost and latency impacts.
Your database handles 1000 QPS. Traffic grows 10x. What do you do first?
Answer: Add read replicas or caching to reduce load on the primary database before scaling vertically or sharding.