| Users / Traffic | Writes per second | Reads per second | Replication Lag | Number of Slaves | Notes |
|---|---|---|---|---|---|
| 100 users | 10 | 50 | Negligible | 1 | Single master, one slave for read scaling |
| 10,000 users | 1,000 | 5,000 | Small, under 1 second | 3-5 | More slaves added to handle read traffic |
| 1,000,000 users | 50,000 | 200,000 | Noticeable, seconds delay | 10-20 | Replication lag increases, master CPU high |
| 100,000,000 users | 1,000,000+ | 5,000,000+ | High, seconds to minutes | 50+ | Master bottleneck, replication lag critical |
Database replication (master-slave) in HLD - Scalability & System Analysis
The master database server is the first bottleneck because it handles all write operations. As traffic grows, the master CPU and disk I/O become overwhelmed. Replication lag to slaves increases, causing stale reads. Network bandwidth between master and slaves can also limit replication speed.
- Read Scaling: Add more slave replicas to distribute read queries.
- Write Scaling: Use sharding to split data across multiple masters.
- Connection Pooling: Reduce overhead by reusing database connections.
- Caching: Use in-memory caches (e.g., Redis) to reduce read load on slaves.
- Asynchronous Replication: Accept some lag to improve master throughput.
- Monitoring: Track replication lag and master load to trigger scaling actions.
Assuming 1 master and 5 slaves at 10,000 users:
- Writes: ~1,000 QPS on master (CPU and disk intensive)
- Reads: ~5,000 QPS distributed across slaves (~1,000 QPS each)
- Replication bandwidth: ~10 MB/s between master and slaves (depends on data size)
- Storage: Master and slaves store full dataset; storage grows with data size
- Network: Master needs high bandwidth to push changes to all slaves
Start by explaining the master-slave model and its purpose (write master, read slaves). Discuss bottlenecks focusing on the master for writes and replication lag. Then propose scaling strategies like adding slaves, sharding, and caching. Always mention trade-offs such as eventual consistency and lag. Use numbers to justify your points.
Your database master handles 1,000 QPS writes. Traffic grows 10x to 10,000 QPS writes. What do you do first and why?
Answer: The master is overwhelmed. First, consider sharding the database to split write load across multiple masters. This reduces load per master and improves write scalability. Adding more slaves won't help write bottleneck.