Microservicessystem_design~25 mins

Event replay in Microservices - System Design Exercise

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Design: Event Replay System for Microservices

Design focuses on event capture, storage, and replay mechanisms for microservices. Out of scope are the internal business logic of microservices and UI design.

Functional Requirements

FR1: Capture and store all events generated by microservices in an immutable log

FR2: Allow replaying events from any point in time to rebuild state or recover from failures

FR3: Support replaying events for a single microservice or multiple microservices

FR4: Ensure event ordering is preserved during replay

FR5: Provide APIs to trigger event replay with filters like time range or event type

FR6: Handle high throughput of events (up to 100,000 events per second)

FR7: Ensure minimal impact on live system performance during event capture and replay

Non-Functional Requirements

NFR1: System must handle 100K events per second ingestion

NFR2: Replay latency should be under 5 minutes for up to 1 million events

NFR3: Availability target of 99.9% uptime

NFR4: Event storage must be durable and immutable

NFR5: Replay must guarantee exactly-once processing semantics

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

❓ Question 7

Key Components

Event producers in microservices

Event log storage (e.g., Kafka, event store)

Event replay service

Event consumers or microservices subscribing to replayed events

API gateway for replay control

Monitoring and alerting system

Design Patterns

Event sourcing

CQRS (Command Query Responsibility Segregation)

Immutable event logs

Idempotent event processing

Backpressure and rate limiting during replay

Reference Architecture

 +----------------+       +----------------+       +----------------+
 | Microservices  | ----> | Event Log Store| <---- | Event Replay   |
 | (Event Prod.)  |       |  (Kafka or ES) |       | Service/API    |
 +----------------+       +----------------+       +----------------+
         |                        |                        |
         v                        |                        v
 +----------------+               |               +----------------+
 | Event Consumers| <--------------+-------------- | Replay Clients |
 | (Live & Replay)|                               +----------------+

Components

Microservices (Event Producers)

Any microservice framework

Generate domain events and publish them to the event log

Event Log Store

Apache Kafka or Event Store DB

Durably store events in order, support high throughput and immutable logs

Event Replay Service/API

Custom service with REST/gRPC API

Provide APIs to trigger event replay with filters and manage replay lifecycle

Event Consumers

Microservices or stream processors

Consume live events and replayed events to update state or trigger actions

Replay Clients

Microservices or batch jobs

Subscribe to replayed events to rebuild state or recover from failures

Monitoring and Alerting

Prometheus, Grafana

Track event ingestion, replay progress, failures, and system health

Request Flow

1. 1. Microservices produce events and publish them to the Event Log Store.

2. 2. Event Log Store appends events in order and stores them durably.

3. 3. Event Consumers subscribe to live events for real-time processing.

4. 4. When replay is needed, a client calls the Event Replay Service API with filters (time range, event type).

5. 5. Event Replay Service reads events from the Event Log Store starting from requested offset or timestamp.

6. 6. Event Replay Service streams events to Replay Clients preserving order and ensuring exactly-once delivery.

7. 7. Replay Clients process events to rebuild state or recover data.

8. 8. Monitoring tracks event flow and replay status to alert on issues.

Database Schema

Entities: - Event: {event_id (PK), timestamp, event_type, payload (JSON), metadata, offset} - ReplayJob: {job_id (PK), start_time, end_time, status, filters} Relationships: - Events are immutable and stored sequentially with offset for ordering - ReplayJob tracks replay requests and progress - No direct relationships between events; ordering is by offset

Scaling Discussion

Bottlenecks

Event Log Store throughput limits under peak load

Replay Service processing large volumes of events causing latency

Network bandwidth during large replay streams

Event Consumers overwhelmed by replay event bursts

Storage growth due to long event retention

Solutions

Partition event log by microservice or event type to parallelize ingestion

Use scalable distributed event log systems like Kafka with multiple brokers

Implement pagination and rate limiting in Replay Service to control replay speed

Use backpressure mechanisms and idempotent consumers to handle replay bursts

Implement tiered storage or archiving for older events to control storage costs

Interview Tips

Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain importance of immutable event logs for replay

Discuss how ordering and exactly-once semantics are ensured

Highlight how replay APIs provide flexibility for recovery

Describe how system handles high throughput and large replays

Mention monitoring and alerting for operational reliability

Practice

(1/5)

1. What is the main purpose of event replay in a microservices architecture?

easy

A. To balance load between microservices

B. To rebuild system state by reprocessing stored events in order

C. To send real-time notifications to users

D. To encrypt data during transmission

Event replay in Microservices - System Design Exercise

Start learning this pattern below

Practice

Solution

Step 1: Understand event replay concept

Step 2: Identify the main purpose

Final Answer:

Quick Check:

Solution

Step 1: Understand importance of event order

Step 2: Identify correct ordering method

Final Answer:

Quick Check:

Solution

Step 1: Sort events by timestamp

Step 2: Extract event names in sorted order

Final Answer:

Quick Check:

Solution

Step 1: Analyze replay error cause

Step 2: Identify the most common cause

Final Answer:

Quick Check:

Solution

Step 1: Understand impact of replay on live system

Step 2: Choose design for performance and safety

Final Answer:

Quick Check: