HLDsystem_design~25 mins

Database replication (master-slave) in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Database Replication (Master-Slave)

Design focuses on the replication architecture, data flow, and failover handling. Out of scope are specific database engine internals and network infrastructure details.

Functional Requirements

FR1: Maintain one master database that handles all write operations.

FR2: Replicate data from the master to one or more slave databases for read operations.

FR3: Ensure slaves are eventually consistent with the master.

FR4: Support read scaling by distributing read queries to slaves.

FR5: Handle failover scenarios where a slave can be promoted to master if needed.

FR6: Provide monitoring for replication lag and health.

Non-Functional Requirements

NFR1: Replication lag should be minimal, ideally under 1 second.

NFR2: System should support up to 10,000 read queries per second.

NFR3: Write latency on master should remain under 100ms.

NFR4: Availability target: 99.9% uptime (about 8.77 hours downtime per year).

NFR5: Data consistency model: eventual consistency between master and slaves.

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Master database server

Slave database servers

Replication mechanism (e.g., binary log shipping, streaming replication)

Load balancer or proxy for read queries

Monitoring and alerting system

Failover management component

Design Patterns

Master-slave replication pattern

Read-write splitting

Failover and leader election

Eventual consistency

Monitoring and alerting

Reference Architecture

          +-------------------+
          |    Application    |
          +---------+---------+
                    |
          +---------v---------+
          |  Load Balancer /   |
          |  Read Proxy       |
          +----+----------+---+
               |          |
       +-------v--+   +---v-------+
       |  Slave 1  |   |  Slave 2  |
       +----------+   +-----------+
               ^          ^
               |          |
          +----+----------+---+
          |     Master DB      |
          +--------------------+

Components

Master Database

Relational DBMS with replication support (e.g., PostgreSQL, MySQL)

Handles all write operations and generates replication logs.

Slave Databases

Same DBMS as master

Receive and apply replication logs from master; serve read queries.

Replication Mechanism

Streaming replication or binary log shipping

Transfers data changes from master to slaves asynchronously.

Load Balancer / Read Proxy

Proxy server or middleware (e.g., HAProxy, ProxySQL)

Distributes read queries among slaves to scale reads.

Monitoring System

Monitoring tools (e.g., Prometheus, Grafana)

Tracks replication lag, server health, and alerts on issues.

Failover Manager

Automated failover tool (e.g., Patroni, Orchestrator)

Detects master failure and promotes a slave to master.

Request Flow

1. 1. Application sends write queries to the master database.

2. 2. Master executes writes and records changes in a replication log.

3. 3. Replication mechanism streams or ships logs to slave databases asynchronously.

4. 4. Slaves apply changes from the replication log to update their data.

5. 5. Application sends read queries to the load balancer or read proxy.

6. 6. Load balancer routes read queries to one of the slave databases.

7. 7. Monitoring system continuously checks replication lag and server health.

8. 8. If master fails, failover manager promotes a slave to master and updates routing.

Database Schema

Entities remain the same as the application database schema. Replication is at the database engine level, so no schema changes are required. Relationships and tables are replicated as-is from master to slaves.

Scaling Discussion

Bottlenecks

Master database write throughput limits.

Replication lag increases with network or load.

Load balancer becomes a single point of failure.

Failover delay impacts availability.

Monitoring system overload with many replicas.

Solutions

Scale master vertically or use sharding for write scaling.

Optimize replication by compressing logs and using faster network links.

Use multiple load balancers with health checks and failover.

Automate failover with fast detection and promotion tools.

Distribute monitoring load and aggregate metrics efficiently.

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying constraints, 20 minutes designing the architecture and data flow, 10 minutes discussing scaling and failover, and 5 minutes summarizing.

Explain the master-slave roles clearly and why writes go to master only.

Discuss eventual consistency and replication lag trade-offs.

Describe how read scaling is achieved by distributing reads to slaves.

Highlight failover strategy and monitoring importance.

Mention bottlenecks and realistic scaling solutions.