| Users / Events | 100 users | 10K users | 1M users | 100M users |
|---|---|---|---|---|
| Event volume per day | ~10K events | ~1M events | ~100M events | ~10B events |
| Event store size | GBs | TBs | 10s of TBs | Petabytes |
| Read model rebuild time | Seconds | Minutes | Hours | Days |
| Number of application servers | 1-2 | 10-20 | 100+ | 1000+ |
| Database throughput (QPS) | 100-500 | 5K-10K | 50K-100K | 1M+ |
| Snapshot frequency | Every 100 events | Every 1K events | Every 10K events | Every 100K events |
Event sourcing in HLD - Scalability & System Analysis
The event store database is the first bottleneck. It must handle a high volume of writes and reads for events. As users and events grow, the database write throughput and storage become limiting factors. Rebuilding read models from large event streams also slows down, impacting query performance.
- Horizontal scaling: Add more application servers behind load balancers to handle increased event processing and queries.
- Event store sharding: Partition event data by aggregate or user ID to distribute load across multiple databases.
- Snapshots: Periodically save aggregate state snapshots to reduce event replay time during read model rebuilds.
- Caching: Use caches for frequently accessed read models to reduce database load.
- Read replicas: Use database replicas to scale read queries separately from writes.
- Archival: Move old events to cheaper storage to reduce primary database size.
- Asynchronous processing: Use message queues and background workers to decouple event handling and improve throughput.
At 1M users generating ~100M events/day:
- Event write rate: ~1,157 events/sec (100M / 86400 sec)
- Database QPS needed: ~2,000 (including reads and writes)
- Storage needed per day: Assuming 1KB per event, ~100GB/day
- Network bandwidth: ~10 MB/s sustained for event ingestion
- Snapshot storage: Depends on snapshot frequency, typically smaller than event store
Start by explaining what event sourcing is and why it helps with auditability and state reconstruction. Then discuss how the event store scales with users and events. Identify the database as the first bottleneck. Propose solutions like sharding, snapshots, and caching. Use numbers to justify your choices. Finally, mention trade-offs like complexity and eventual consistency.
Your event store database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first and why?
Answer: The first step is to shard the event store to distribute the write load across multiple database instances. This reduces the load on any single database and allows scaling writes horizontally. Additionally, implement snapshots to reduce read load and improve read model rebuild times.
