| Scale | Users / Events | System Changes |
|---|---|---|
| 100 users | ~10K events/day | Single event store instance; simple replay; low latency |
| 10K users | ~1M events/day | Partition event store; add read replicas; batch replay; introduce caching |
| 1M users | ~100M events/day | Sharded event store; distributed replay workers; event compaction; asynchronous replay |
| 100M users | ~10B events/day | Multi-region event stores; advanced partitioning; replay throttling; event archival; CDN for event snapshots |
Event replay in Microservices - Scalability & System Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
The event store database is the first bottleneck. As event volume grows, the database struggles to handle high write and read throughput for storing and replaying events. This causes increased latency and potential data loss during replay.
- Horizontal scaling: Add more event store nodes and partition events by user or event type to distribute load.
- Read replicas: Use replicas to offload replay reads from the primary event store.
- Caching: Cache frequently replayed event sequences to reduce database hits.
- Batch processing: Replay events in batches asynchronously to smooth load.
- Event compaction: Summarize or snapshot event streams to reduce replay size.
- Multi-region deployment: Deploy event stores closer to users to reduce latency.
- Throttling: Limit replay request rates to prevent overload.
- Archival: Move old events to cheaper storage to keep active event store performant.
- At 1M users generating 100M events/day (~1157 events/sec), event store must handle ~1200 writes/sec plus replay reads.
- Storage needed: Assuming 1KB per event, 100M events/day = ~100GB/day; requires scalable storage and retention policies.
- Network bandwidth: For replay, streaming event data can consume significant bandwidth; e.g., 1K replays/sec * 1MB replay size = ~1GB/s peak.
- Compute: Replay workers must be scaled horizontally to process event streams without delay.
Start by explaining the event replay flow and identify the main components. Discuss how event volume affects storage and replay latency. Highlight the event store as the bottleneck and propose scaling strategies like partitioning and caching. Use concrete numbers to justify your choices and mention trade-offs like consistency vs. availability.
Your event store database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Add read replicas and partition the event store to distribute load horizontally. This reduces pressure on a single database instance and maintains replay performance.
Practice
event replay in a microservices architecture?Solution
Step 1: Understand event replay concept
Event replay means using stored events to reconstruct the current state of a system by processing them again in the order they occurred.Step 2: Identify the main purpose
This process helps recover system state after failures or to debug by looking at past events, not for notifications, load balancing, or encryption.Final Answer:
To rebuild system state by reprocessing stored events in order -> Option BQuick Check:
Event replay = rebuild state [OK]
- Confusing event replay with real-time messaging
- Thinking event replay balances load
- Assuming event replay encrypts data
Solution
Step 1: Understand importance of event order
Events must be replayed in the exact order they occurred to correctly rebuild system state.Step 2: Identify correct ordering method
Using timestamps to sort events chronologically ensures the correct sequence during replay.Final Answer:
Store events with timestamps and replay by sorting them chronologically -> Option DQuick Check:
Correct event order = chronological replay [OK]
- Replaying events randomly
- Skipping older events
- Ignoring event order
[(1, 'create'), (3, 'update'), (2, 'update'), (4, 'delete')]What is the correct order of events during replay?
Solution
Step 1: Sort events by timestamp
Sort the list by the first element (timestamp): 1, 2, 3, 4.Step 2: Extract event names in sorted order
Events in order: 'create' (1), 'update' (2), 'update' (3), 'delete' (4).Final Answer:
[('create'), ('update'), ('update'), ('delete')] -> Option CQuick Check:
Sorted timestamps = 1,2,3,4 [OK]
- Ignoring timestamp order
- Mixing event sequence
- Assuming original list order is correct
Solution
Step 1: Analyze replay error cause
Incorrect system state after replay usually means the event sequence was not preserved.Step 2: Identify the most common cause
Replaying events out of order breaks the state reconstruction logic, causing errors.Final Answer:
Events were replayed out of order -> Option AQuick Check:
Out-of-order replay = wrong state [OK]
- Blaming encryption which doesn't affect replay order
- Assuming parallel replay is always safe
- Filtering events without understanding impact
Solution
Step 1: Understand impact of replay on live system
Replaying events synchronously during user requests can slow down or disrupt the live system.Step 2: Choose design for performance and safety
Using a separate copy of the event store and replaying asynchronously isolates analysis from live traffic, preserving performance.Final Answer:
Replay events asynchronously from a separate event store copy -> Option AQuick Check:
Async replay on copy = no live impact [OK]
- Replaying synchronously blocking live requests
- Analyzing only latest event missing history
- Ignoring benefits of event replay for analysis
