0
0
HLDsystem_design~25 mins

Database replication (master-slave) in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Database Replication (Master-Slave)
Design focuses on the replication architecture, data flow, and failover handling. Out of scope are specific database engine internals and network infrastructure details.
Functional Requirements
FR1: Maintain one master database that handles all write operations.
FR2: Replicate data from the master to one or more slave databases for read operations.
FR3: Ensure slaves are eventually consistent with the master.
FR4: Support read scaling by distributing read queries to slaves.
FR5: Handle failover scenarios where a slave can be promoted to master if needed.
FR6: Provide monitoring for replication lag and health.
Non-Functional Requirements
NFR1: Replication lag should be minimal, ideally under 1 second.
NFR2: System should support up to 10,000 read queries per second.
NFR3: Write latency on master should remain under 100ms.
NFR4: Availability target: 99.9% uptime (about 8.77 hours downtime per year).
NFR5: Data consistency model: eventual consistency between master and slaves.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Master database server
Slave database servers
Replication mechanism (e.g., binary log shipping, streaming replication)
Load balancer or proxy for read queries
Monitoring and alerting system
Failover management component
Design Patterns
Master-slave replication pattern
Read-write splitting
Failover and leader election
Eventual consistency
Monitoring and alerting
Reference Architecture
          +-------------------+
          |    Application    |
          +---------+---------+
                    |
          +---------v---------+
          |  Load Balancer /   |
          |  Read Proxy       |
          +----+----------+---+
               |          |
       +-------v--+   +---v-------+
       |  Slave 1  |   |  Slave 2  |
       +----------+   +-----------+
               ^          ^
               |          |
          +----+----------+---+
          |     Master DB      |
          +--------------------+
Components
Master Database
Relational DBMS with replication support (e.g., PostgreSQL, MySQL)
Handles all write operations and generates replication logs.
Slave Databases
Same DBMS as master
Receive and apply replication logs from master; serve read queries.
Replication Mechanism
Streaming replication or binary log shipping
Transfers data changes from master to slaves asynchronously.
Load Balancer / Read Proxy
Proxy server or middleware (e.g., HAProxy, ProxySQL)
Distributes read queries among slaves to scale reads.
Monitoring System
Monitoring tools (e.g., Prometheus, Grafana)
Tracks replication lag, server health, and alerts on issues.
Failover Manager
Automated failover tool (e.g., Patroni, Orchestrator)
Detects master failure and promotes a slave to master.
Request Flow
1. 1. Application sends write queries to the master database.
2. 2. Master executes writes and records changes in a replication log.
3. 3. Replication mechanism streams or ships logs to slave databases asynchronously.
4. 4. Slaves apply changes from the replication log to update their data.
5. 5. Application sends read queries to the load balancer or read proxy.
6. 6. Load balancer routes read queries to one of the slave databases.
7. 7. Monitoring system continuously checks replication lag and server health.
8. 8. If master fails, failover manager promotes a slave to master and updates routing.
Database Schema
Entities remain the same as the application database schema. Replication is at the database engine level, so no schema changes are required. Relationships and tables are replicated as-is from master to slaves.
Scaling Discussion
Bottlenecks
Master database write throughput limits.
Replication lag increases with network or load.
Load balancer becomes a single point of failure.
Failover delay impacts availability.
Monitoring system overload with many replicas.
Solutions
Scale master vertically or use sharding for write scaling.
Optimize replication by compressing logs and using faster network links.
Use multiple load balancers with health checks and failover.
Automate failover with fast detection and promotion tools.
Distribute monitoring load and aggregate metrics efficiently.
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying constraints, 20 minutes designing the architecture and data flow, 10 minutes discussing scaling and failover, and 5 minutes summarizing.
Explain the master-slave roles clearly and why writes go to master only.
Discuss eventual consistency and replication lag trade-offs.
Describe how read scaling is achieved by distributing reads to slaves.
Highlight failover strategy and monitoring importance.
Mention bottlenecks and realistic scaling solutions.