0
0
Microservicessystem_design~10 mins

Event store concept in Microservices - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Event store concept
Growth Table: Event Store Scaling
Users / Events100 Users10K Users1M Users100M Users
Event Volume per Day~10K events~1M events~100M events~10B events
Event Store Size~100 MB~10 GB~1 TB~100 TB+
Write Throughput~100 QPS~10K QPS~1M QPS~100M QPS (distributed)
Read Throughput~100 QPS~10K QPS~1M QPS~100M QPS (distributed)
LatencyLow (ms)Low (ms)Moderate (ms to 10s ms)Higher (tens of ms)
InfrastructureSingle server or small clusterCluster with replicationSharded clusters, partitioned storageGlobal distributed clusters, multi-region
First Bottleneck

The first bottleneck is the event store database write throughput. As events grow, the database struggles to handle the high volume of writes and maintain low latency. This is because event stores append many small writes, which can saturate disk I/O and CPU on a single node.

Scaling Solutions
  • Horizontal scaling: Add more event store nodes and partition events by aggregate or stream ID (sharding) to distribute write load.
  • Write batching: Group multiple events into batches to reduce I/O overhead.
  • Caching: Use in-memory caches for recent events or snapshots to speed up reads.
  • Event snapshots: Periodically store snapshots of aggregate state to reduce replay time.
  • Replication: Use read replicas to scale read throughput and improve availability.
  • Storage tiering: Archive older events to cheaper, slower storage to keep hot storage performant.
  • Use specialized event store databases: Databases optimized for append-only workloads (e.g., Apache Kafka, EventStoreDB) improve performance.
Back-of-Envelope Cost Analysis
  • At 10K users generating 1M events/day, expect ~12 QPS sustained writes (1M / 86400 seconds).
  • At 1M users generating 100M events/day, expect ~1,157 QPS sustained writes.
  • Storage needed grows roughly 1 KB per event, so 100M events ~100 GB per day.
  • Network bandwidth must support event replication and client reads; 1 Gbps network can handle ~125 MB/s, enough for ~125K events/s at 1 KB each.
  • CPU and disk I/O must be provisioned to handle peak bursts, not just average QPS.
Interview Tip

Start by explaining the event store's role as an append-only log of events. Discuss how writes dominate the workload and how latency matters. Then, identify the database write throughput as the first bottleneck. Propose sharding and replication as solutions. Mention caching and snapshots to optimize reads. Finally, consider storage growth and archival strategies. Keep your explanation clear and structured.

Self Check Question

Your event store database handles 1000 QPS writes. Traffic grows 10x to 10,000 QPS. What do you do first and why?

Answer: The first step is to shard the event store by partitioning events across multiple nodes. This distributes the write load so no single database node is overwhelmed, allowing the system to handle 10x more writes without latency spikes.

Key Result
The event store first breaks at database write throughput as event volume grows. Sharding and replication are key to scaling writes and reads efficiently.