Microservicessystem_design~7 mins

Event replay in Microservices - System Design Guide

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Problem Statement

When a microservice crashes or a new service instance starts, it may miss important past events needed to build its current state. Without a way to recover these events, the service can produce incorrect results or inconsistent data.

Solution

Event replay solves this by storing all events in an immutable log. When a service needs to recover or catch up, it replays these stored events in order to rebuild its state exactly as it was before the failure or startup.

Architecture

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Event Producer│──────▶│ Event Store   │──────▶│ Service       │
└───────────────┘       └───────────────┘       └───────────────┘
                                ▲                      │
                                │                      │
                                └──────────────────────┘
                                Event Replay Flow

This diagram shows events produced by a service stored in an event store. The service replays events from the store to rebuild its state.

Trade-offs

✓ Pros

→

Ensures services can recover state accurately after crashes or restarts.

→

Enables new services to bootstrap state by replaying historical events.

→

Provides a complete audit trail of all changes for debugging and compliance.

✗ Cons

→

Replaying large event logs can be slow and resource-intensive.

→

Requires careful versioning of events to handle schema changes over time.

→

Event stores grow indefinitely unless pruning or snapshots are implemented.

Use event replay when services maintain complex state that must be rebuilt reliably, especially in systems with frequent restarts or scaling events, typically at scales above thousands of events per second.

Avoid event replay if your service state is simple and can be reconstructed from a database snapshot quickly, or if event volume is very low (under hundreds per day) where replay overhead outweighs benefits.

Real World Examples

Uber

Uber uses event replay to rebuild trip and driver state after service failures, ensuring no data loss in their real-time dispatch system.

LinkedIn replays user activity events to reconstruct timelines and notifications after outages or service upgrades.

Netflix

Netflix replays streaming session events to recover user playback state and preferences after client or server restarts.

Code Example

The before code stores only the current state and loses data on restart. The after code stores all events in an event store and replays them on recovery to rebuild the state exactly.

Microservices

### Before: Service stores only current state, no event replay
class OrderService:
    def __init__(self):
        self.orders = {}

    def create_order(self, order_id, details):
        self.orders[order_id] = details

    def recover(self):
        # No way to recover lost state
        self.orders = {}


### After: Service stores events and replays them to recover
class EventStore:
    def __init__(self):
        self.events = []

    def append(self, event):
        self.events.append(event)

    def get_all(self):
        return self.events


class OrderServiceWithReplay:
    def __init__(self, event_store):
        self.orders = {}
        self.event_store = event_store

    def create_order(self, order_id, details):
        event = {'type': 'OrderCreated', 'order_id': order_id, 'details': details}
        self.event_store.append(event)
        self.apply(event)

    def apply(self, event):
        if event['type'] == 'OrderCreated':
            self.orders[event['order_id']] = event['details']

    def recover(self):
        self.orders = {}
        for event in self.event_store.get_all():
            self.apply(event)

OutputSuccess

Alternatives

Snapshotting

Stores periodic full state snapshots to speed up recovery instead of replaying all events from the start.

Use when: Choose snapshotting when event logs are very large and replaying all events is too slow.

Stateful Database Replication

Replicates current state directly between databases without replaying events.

Use when: Choose this when event sourcing is not used and state changes are stored as direct database updates.

Summary

Event replay stores all changes as events and replays them to rebuild service state after failures.

It ensures accurate recovery and auditability but can be slow for large event logs without snapshots.

Use event replay when state recovery is critical and event volume justifies the complexity.

Practice

(1/5)

1. What is the main purpose of event replay in a microservices architecture?

easy

A. To balance load between microservices

B. To rebuild system state by reprocessing stored events in order

C. To send real-time notifications to users

D. To encrypt data during transmission

Event replay in Microservices - System Design Guide

Start learning this pattern below

Practice

Solution

Step 1: Understand event replay concept

Step 2: Identify the main purpose

Final Answer:

Quick Check:

Solution

Step 1: Understand importance of event order

Step 2: Identify correct ordering method

Final Answer:

Quick Check:

Solution

Step 1: Sort events by timestamp

Step 2: Extract event names in sorted order

Final Answer:

Quick Check:

Solution

Step 1: Analyze replay error cause

Step 2: Identify the most common cause

Final Answer:

Quick Check:

Solution

Step 1: Understand impact of replay on live system

Step 2: Choose design for performance and safety

Final Answer:

Quick Check: