HLDsystem_design~10 mins

Event sourcing in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - Event sourcing

Growth Table for Event Sourcing System

Users / Events	100 users	10K users	1M users	100M users
Event volume per day	~10K events	~1M events	~100M events	~10B events
Event store size	GBs	TBs	10s of TBs	Petabytes
Read model rebuild time	Seconds	Minutes	Hours	Days
Number of application servers	1-2	10-20	100+	1000+
Database throughput (QPS)	100-500	5K-10K	50K-100K	1M+
Snapshot frequency	Every 100 events	Every 1K events	Every 10K events	Every 100K events

First Bottleneck

The event store database is the first bottleneck. It must handle a high volume of writes and reads for events. As users and events grow, the database write throughput and storage become limiting factors. Rebuilding read models from large event streams also slows down, impacting query performance.

Scaling Solutions

Horizontal scaling: Add more application servers behind load balancers to handle increased event processing and queries.
Event store sharding: Partition event data by aggregate or user ID to distribute load across multiple databases.
Snapshots: Periodically save aggregate state snapshots to reduce event replay time during read model rebuilds.
Caching: Use caches for frequently accessed read models to reduce database load.
Read replicas: Use database replicas to scale read queries separately from writes.
Archival: Move old events to cheaper storage to reduce primary database size.
Asynchronous processing: Use message queues and background workers to decouple event handling and improve throughput.

Back-of-Envelope Cost Analysis

At 1M users generating ~100M events/day:

Event write rate: ~1,157 events/sec (100M / 86400 sec)
Database QPS needed: ~2,000 (including reads and writes)
Storage needed per day: Assuming 1KB per event, ~100GB/day
Network bandwidth: ~10 MB/s sustained for event ingestion
Snapshot storage: Depends on snapshot frequency, typically smaller than event store

Interview Tip

Start by explaining what event sourcing is and why it helps with auditability and state reconstruction. Then discuss how the event store scales with users and events. Identify the database as the first bottleneck. Propose solutions like sharding, snapshots, and caching. Use numbers to justify your choices. Finally, mention trade-offs like complexity and eventual consistency.

Self Check Question

Your event store database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first and why?

Answer: The first step is to shard the event store to distribute the write load across multiple database instances. This reduces the load on any single database and allows scaling writes horizontally. Additionally, implement snapshots to reduce read load and improve read model rebuild times.

Key Result

Event sourcing systems first hit bottlenecks at the event store database due to high write and read loads. Scaling requires sharding, snapshots, caching, and horizontal scaling of application servers to handle growing event volumes and user counts.