0
0
Microservicessystem_design~25 mins

Event replay in Microservices - System Design Exercise

Choose your learning style9 modes available
Design: Event Replay System for Microservices
Design focuses on event capture, storage, and replay mechanisms for microservices. Out of scope are the internal business logic of microservices and UI design.
Functional Requirements
FR1: Capture and store all events generated by microservices in an immutable log
FR2: Allow replaying events from any point in time to rebuild state or recover from failures
FR3: Support replaying events for a single microservice or multiple microservices
FR4: Ensure event ordering is preserved during replay
FR5: Provide APIs to trigger event replay with filters like time range or event type
FR6: Handle high throughput of events (up to 100,000 events per second)
FR7: Ensure minimal impact on live system performance during event capture and replay
Non-Functional Requirements
NFR1: System must handle 100K events per second ingestion
NFR2: Replay latency should be under 5 minutes for up to 1 million events
NFR3: Availability target of 99.9% uptime
NFR4: Event storage must be durable and immutable
NFR5: Replay must guarantee exactly-once processing semantics
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
❓ Question 7
Key Components
Event producers in microservices
Event log storage (e.g., Kafka, event store)
Event replay service
Event consumers or microservices subscribing to replayed events
API gateway for replay control
Monitoring and alerting system
Design Patterns
Event sourcing
CQRS (Command Query Responsibility Segregation)
Immutable event logs
Idempotent event processing
Backpressure and rate limiting during replay
Reference Architecture
 +----------------+       +----------------+       +----------------+
 | Microservices  | ----> | Event Log Store| <---- | Event Replay   |
 | (Event Prod.)  |       |  (Kafka or ES) |       | Service/API    |
 +----------------+       +----------------+       +----------------+
         |                        |                        |
         v                        |                        v
 +----------------+               |               +----------------+
 | Event Consumers| <--------------+-------------- | Replay Clients |
 | (Live & Replay)|                               +----------------+
Components
Microservices (Event Producers)
Any microservice framework
Generate domain events and publish them to the event log
Event Log Store
Apache Kafka or Event Store DB
Durably store events in order, support high throughput and immutable logs
Event Replay Service/API
Custom service with REST/gRPC API
Provide APIs to trigger event replay with filters and manage replay lifecycle
Event Consumers
Microservices or stream processors
Consume live events and replayed events to update state or trigger actions
Replay Clients
Microservices or batch jobs
Subscribe to replayed events to rebuild state or recover from failures
Monitoring and Alerting
Prometheus, Grafana
Track event ingestion, replay progress, failures, and system health
Request Flow
1. 1. Microservices produce events and publish them to the Event Log Store.
2. 2. Event Log Store appends events in order and stores them durably.
3. 3. Event Consumers subscribe to live events for real-time processing.
4. 4. When replay is needed, a client calls the Event Replay Service API with filters (time range, event type).
5. 5. Event Replay Service reads events from the Event Log Store starting from requested offset or timestamp.
6. 6. Event Replay Service streams events to Replay Clients preserving order and ensuring exactly-once delivery.
7. 7. Replay Clients process events to rebuild state or recover data.
8. 8. Monitoring tracks event flow and replay status to alert on issues.
Database Schema
Entities: - Event: {event_id (PK), timestamp, event_type, payload (JSON), metadata, offset} - ReplayJob: {job_id (PK), start_time, end_time, status, filters} Relationships: - Events are immutable and stored sequentially with offset for ordering - ReplayJob tracks replay requests and progress - No direct relationships between events; ordering is by offset
Scaling Discussion
Bottlenecks
Event Log Store throughput limits under peak load
Replay Service processing large volumes of events causing latency
Network bandwidth during large replay streams
Event Consumers overwhelmed by replay event bursts
Storage growth due to long event retention
Solutions
Partition event log by microservice or event type to parallelize ingestion
Use scalable distributed event log systems like Kafka with multiple brokers
Implement pagination and rate limiting in Replay Service to control replay speed
Use backpressure mechanisms and idempotent consumers to handle replay bursts
Implement tiered storage or archiving for older events to control storage costs
Interview Tips
Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain importance of immutable event logs for replay
Discuss how ordering and exactly-once semantics are ensured
Highlight how replay APIs provide flexibility for recovery
Describe how system handles high throughput and large replays
Mention monitoring and alerting for operational reliability