Bird
Raised Fist0
HLDsystem_design~10 mins

Event sourcing in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Event sourcing
Growth Table for Event Sourcing System
Users / Events100 users10K users1M users100M users
Event volume per day~10K events~1M events~100M events~10B events
Event store sizeGBsTBs10s of TBsPetabytes
Read model rebuild timeSecondsMinutesHoursDays
Number of application servers1-210-20100+1000+
Database throughput (QPS)100-5005K-10K50K-100K1M+
Snapshot frequencyEvery 100 eventsEvery 1K eventsEvery 10K eventsEvery 100K events
First Bottleneck

The event store database is the first bottleneck. It must handle a high volume of writes and reads for events. As users and events grow, the database write throughput and storage become limiting factors. Rebuilding read models from large event streams also slows down, impacting query performance.

Scaling Solutions
  • Horizontal scaling: Add more application servers behind load balancers to handle increased event processing and queries.
  • Event store sharding: Partition event data by aggregate or user ID to distribute load across multiple databases.
  • Snapshots: Periodically save aggregate state snapshots to reduce event replay time during read model rebuilds.
  • Caching: Use caches for frequently accessed read models to reduce database load.
  • Read replicas: Use database replicas to scale read queries separately from writes.
  • Archival: Move old events to cheaper storage to reduce primary database size.
  • Asynchronous processing: Use message queues and background workers to decouple event handling and improve throughput.
Back-of-Envelope Cost Analysis

At 1M users generating ~100M events/day:

  • Event write rate: ~1,157 events/sec (100M / 86400 sec)
  • Database QPS needed: ~2,000 (including reads and writes)
  • Storage needed per day: Assuming 1KB per event, ~100GB/day
  • Network bandwidth: ~10 MB/s sustained for event ingestion
  • Snapshot storage: Depends on snapshot frequency, typically smaller than event store
Interview Tip

Start by explaining what event sourcing is and why it helps with auditability and state reconstruction. Then discuss how the event store scales with users and events. Identify the database as the first bottleneck. Propose solutions like sharding, snapshots, and caching. Use numbers to justify your choices. Finally, mention trade-offs like complexity and eventual consistency.

Self Check Question

Your event store database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first and why?

Answer: The first step is to shard the event store to distribute the write load across multiple database instances. This reduces the load on any single database and allows scaling writes horizontally. Additionally, implement snapshots to reduce read load and improve read model rebuild times.

Key Result
Event sourcing systems first hit bottlenecks at the event store database due to high write and read loads. Scaling requires sharding, snapshots, caching, and horizontal scaling of application servers to handle growing event volumes and user counts.