| Users / Events | 100 Users | 10K Users | 1M Users | 100M Users |
|---|---|---|---|---|
| Event Volume per Day | ~10K events | ~1M events | ~100M events | ~10B events |
| Event Store Size | ~100 MB | ~10 GB | ~1 TB | ~100 TB+ |
| Write Throughput | ~100 QPS | ~10K QPS | ~1M QPS | ~100M QPS (distributed) |
| Read Throughput | ~100 QPS | ~10K QPS | ~1M QPS | ~100M QPS (distributed) |
| Latency | Low (ms) | Low (ms) | Moderate (ms to 10s ms) | Higher (tens of ms) |
| Infrastructure | Single server or small cluster | Cluster with replication | Sharded clusters, partitioned storage | Global distributed clusters, multi-region |
Event store concept in Microservices - Scalability & System Analysis
The first bottleneck is the event store database write throughput. As events grow, the database struggles to handle the high volume of writes and maintain low latency. This is because event stores append many small writes, which can saturate disk I/O and CPU on a single node.
- Horizontal scaling: Add more event store nodes and partition events by aggregate or stream ID (sharding) to distribute write load.
- Write batching: Group multiple events into batches to reduce I/O overhead.
- Caching: Use in-memory caches for recent events or snapshots to speed up reads.
- Event snapshots: Periodically store snapshots of aggregate state to reduce replay time.
- Replication: Use read replicas to scale read throughput and improve availability.
- Storage tiering: Archive older events to cheaper, slower storage to keep hot storage performant.
- Use specialized event store databases: Databases optimized for append-only workloads (e.g., Apache Kafka, EventStoreDB) improve performance.
- At 10K users generating 1M events/day, expect ~12 QPS sustained writes (1M / 86400 seconds).
- At 1M users generating 100M events/day, expect ~1,157 QPS sustained writes.
- Storage needed grows roughly 1 KB per event, so 100M events ~100 GB per day.
- Network bandwidth must support event replication and client reads; 1 Gbps network can handle ~125 MB/s, enough for ~125K events/s at 1 KB each.
- CPU and disk I/O must be provisioned to handle peak bursts, not just average QPS.
Start by explaining the event store's role as an append-only log of events. Discuss how writes dominate the workload and how latency matters. Then, identify the database write throughput as the first bottleneck. Propose sharding and replication as solutions. Mention caching and snapshots to optimize reads. Finally, consider storage growth and archival strategies. Keep your explanation clear and structured.
Your event store database handles 1000 QPS writes. Traffic grows 10x to 10,000 QPS. What do you do first and why?
Answer: The first step is to shard the event store by partitioning events across multiple nodes. This distributes the write load so no single database node is overwhelmed, allowing the system to handle 10x more writes without latency spikes.