Bird
Raised Fist0
Microservicessystem_design~25 mins

Event replay in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Event Replay System for Microservices
Design focuses on event capture, storage, and replay mechanisms for microservices. Out of scope are the internal business logic of microservices and UI design.
Functional Requirements
FR1: Capture and store all events generated by microservices in an immutable log
FR2: Allow replaying events from any point in time to rebuild state or recover from failures
FR3: Support replaying events for a single microservice or multiple microservices
FR4: Ensure event ordering is preserved during replay
FR5: Provide APIs to trigger event replay with filters like time range or event type
FR6: Handle high throughput of events (up to 100,000 events per second)
FR7: Ensure minimal impact on live system performance during event capture and replay
Non-Functional Requirements
NFR1: System must handle 100K events per second ingestion
NFR2: Replay latency should be under 5 minutes for up to 1 million events
NFR3: Availability target of 99.9% uptime
NFR4: Event storage must be durable and immutable
NFR5: Replay must guarantee exactly-once processing semantics
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
❓ Question 7
Key Components
Event producers in microservices
Event log storage (e.g., Kafka, event store)
Event replay service
Event consumers or microservices subscribing to replayed events
API gateway for replay control
Monitoring and alerting system
Design Patterns
Event sourcing
CQRS (Command Query Responsibility Segregation)
Immutable event logs
Idempotent event processing
Backpressure and rate limiting during replay
Reference Architecture
 +----------------+       +----------------+       +----------------+
 | Microservices  | ----> | Event Log Store| <---- | Event Replay   |
 | (Event Prod.)  |       |  (Kafka or ES) |       | Service/API    |
 +----------------+       +----------------+       +----------------+
         |                        |                        |
         v                        |                        v
 +----------------+               |               +----------------+
 | Event Consumers| <--------------+-------------- | Replay Clients |
 | (Live & Replay)|                               +----------------+
Components
Microservices (Event Producers)
Any microservice framework
Generate domain events and publish them to the event log
Event Log Store
Apache Kafka or Event Store DB
Durably store events in order, support high throughput and immutable logs
Event Replay Service/API
Custom service with REST/gRPC API
Provide APIs to trigger event replay with filters and manage replay lifecycle
Event Consumers
Microservices or stream processors
Consume live events and replayed events to update state or trigger actions
Replay Clients
Microservices or batch jobs
Subscribe to replayed events to rebuild state or recover from failures
Monitoring and Alerting
Prometheus, Grafana
Track event ingestion, replay progress, failures, and system health
Request Flow
1. 1. Microservices produce events and publish them to the Event Log Store.
2. 2. Event Log Store appends events in order and stores them durably.
3. 3. Event Consumers subscribe to live events for real-time processing.
4. 4. When replay is needed, a client calls the Event Replay Service API with filters (time range, event type).
5. 5. Event Replay Service reads events from the Event Log Store starting from requested offset or timestamp.
6. 6. Event Replay Service streams events to Replay Clients preserving order and ensuring exactly-once delivery.
7. 7. Replay Clients process events to rebuild state or recover data.
8. 8. Monitoring tracks event flow and replay status to alert on issues.
Database Schema
Entities: - Event: {event_id (PK), timestamp, event_type, payload (JSON), metadata, offset} - ReplayJob: {job_id (PK), start_time, end_time, status, filters} Relationships: - Events are immutable and stored sequentially with offset for ordering - ReplayJob tracks replay requests and progress - No direct relationships between events; ordering is by offset
Scaling Discussion
Bottlenecks
Event Log Store throughput limits under peak load
Replay Service processing large volumes of events causing latency
Network bandwidth during large replay streams
Event Consumers overwhelmed by replay event bursts
Storage growth due to long event retention
Solutions
Partition event log by microservice or event type to parallelize ingestion
Use scalable distributed event log systems like Kafka with multiple brokers
Implement pagination and rate limiting in Replay Service to control replay speed
Use backpressure mechanisms and idempotent consumers to handle replay bursts
Implement tiered storage or archiving for older events to control storage costs
Interview Tips
Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain importance of immutable event logs for replay
Discuss how ordering and exactly-once semantics are ensured
Highlight how replay APIs provide flexibility for recovery
Describe how system handles high throughput and large replays
Mention monitoring and alerting for operational reliability

Practice

(1/5)
1. What is the main purpose of event replay in a microservices architecture?
easy
A. To balance load between microservices
B. To rebuild system state by reprocessing stored events in order
C. To send real-time notifications to users
D. To encrypt data during transmission

Solution

  1. Step 1: Understand event replay concept

    Event replay means using stored events to reconstruct the current state of a system by processing them again in the order they occurred.
  2. Step 2: Identify the main purpose

    This process helps recover system state after failures or to debug by looking at past events, not for notifications, load balancing, or encryption.
  3. Final Answer:

    To rebuild system state by reprocessing stored events in order -> Option B
  4. Quick Check:

    Event replay = rebuild state [OK]
Hint: Event replay means replaying past events to restore state [OK]
Common Mistakes:
  • Confusing event replay with real-time messaging
  • Thinking event replay balances load
  • Assuming event replay encrypts data
2. Which of the following is the correct way to ensure events are replayed in the right order?
easy
A. Ignore event order since it doesn't affect state
B. Replay events randomly to speed up processing
C. Replay only the latest event to save resources
D. Store events with timestamps and replay by sorting them chronologically

Solution

  1. Step 1: Understand importance of event order

    Events must be replayed in the exact order they occurred to correctly rebuild system state.
  2. Step 2: Identify correct ordering method

    Using timestamps to sort events chronologically ensures the correct sequence during replay.
  3. Final Answer:

    Store events with timestamps and replay by sorting them chronologically -> Option D
  4. Quick Check:

    Correct event order = chronological replay [OK]
Hint: Replay events by timestamp order to keep state consistent [OK]
Common Mistakes:
  • Replaying events randomly
  • Skipping older events
  • Ignoring event order
3. Given the following event log stored as tuples (timestamp, event):
[(1, 'create'), (3, 'update'), (2, 'update'), (4, 'delete')]
What is the correct order of events during replay?
medium
A. [('update'), ('create'), ('delete'), ('update')]
B. [('delete'), ('update'), ('create'), ('update')]
C. [('create'), ('update'), ('update'), ('delete')]
D. [('update'), ('delete'), ('create'), ('update')]

Solution

  1. Step 1: Sort events by timestamp

    Sort the list by the first element (timestamp): 1, 2, 3, 4.
  2. Step 2: Extract event names in sorted order

    Events in order: 'create' (1), 'update' (2), 'update' (3), 'delete' (4).
  3. Final Answer:

    [('create'), ('update'), ('update'), ('delete')] -> Option C
  4. Quick Check:

    Sorted timestamps = 1,2,3,4 [OK]
Hint: Sort by timestamp, then list events in that order [OK]
Common Mistakes:
  • Ignoring timestamp order
  • Mixing event sequence
  • Assuming original list order is correct
4. A microservice tries to replay events but the system state is incorrect after replay. Which issue is most likely causing this?
medium
A. Events were replayed out of order
B. Events were encrypted during replay
C. Events were replayed multiple times in parallel
D. Events were filtered by type before replay

Solution

  1. Step 1: Analyze replay error cause

    Incorrect system state after replay usually means the event sequence was not preserved.
  2. Step 2: Identify the most common cause

    Replaying events out of order breaks the state reconstruction logic, causing errors.
  3. Final Answer:

    Events were replayed out of order -> Option A
  4. Quick Check:

    Out-of-order replay = wrong state [OK]
Hint: Check event order first when state is wrong after replay [OK]
Common Mistakes:
  • Blaming encryption which doesn't affect replay order
  • Assuming parallel replay is always safe
  • Filtering events without understanding impact
5. You want to add a new feature that analyzes historical user actions using event replay. Which design choice best supports this without affecting live system performance?
hard
A. Replay events asynchronously from a separate event store copy
B. Replay events synchronously on the main database during user requests
C. Replay only the latest event repeatedly for analysis
D. Skip event replay and query live data directly

Solution

  1. Step 1: Understand impact of replay on live system

    Replaying events synchronously during user requests can slow down or disrupt the live system.
  2. Step 2: Choose design for performance and safety

    Using a separate copy of the event store and replaying asynchronously isolates analysis from live traffic, preserving performance.
  3. Final Answer:

    Replay events asynchronously from a separate event store copy -> Option A
  4. Quick Check:

    Async replay on copy = no live impact [OK]
Hint: Use async replay on separate store to avoid live system load [OK]
Common Mistakes:
  • Replaying synchronously blocking live requests
  • Analyzing only latest event missing history
  • Ignoring benefits of event replay for analysis