| Users | System Behavior | Failure Impact | Recovery Time |
|---|---|---|---|
| 100 users | System mostly stable | Minor disruption if SPOF fails | Minutes to recover manually |
| 10,000 users | Increased load on SPOF component | Partial or full outage affecting many users | Longer downtime, manual intervention needed |
| 1,000,000 users | High dependency on SPOF | Major outage, large user base affected | Significant downtime, costly recovery |
| 100,000,000 users | Critical SPOF causes system-wide failure | Complete service outage, severe business impact | Extended downtime, emergency fixes required |
Single point of failure identification in HLD - Scalability & System Analysis
The first bottleneck is the component or resource that, if it fails, stops the entire system from working.
Examples include a single database server without replicas, one load balancer without failover, or a single network switch.
At small scale, this might cause minor issues, but as users grow, the impact becomes severe and affects availability.
- Redundancy: Add duplicate components (e.g., multiple servers, databases) so if one fails, others take over.
- Load Balancing: Distribute traffic across multiple instances to avoid overloading one component.
- Failover Mechanisms: Automatic switching to backup systems when primary fails.
- Data Replication: Keep copies of data in multiple places to avoid data loss and downtime.
- Health Checks and Monitoring: Detect failures early and trigger recovery actions.
- Decoupling Components: Use message queues or event-driven designs to reduce tight dependencies.
Assuming a system with 10,000 users generating 100 requests per second (RPS):
- Single server can handle ~5,000 concurrent connections; 2 servers needed for load and redundancy.
- Database with 5,000 QPS capacity requires read replicas for scaling reads and failover.
- Network bandwidth: 1 Gbps (~125 MB/s) sufficient for typical traffic; multiple network paths recommended.
- Adding redundancy roughly doubles infrastructure cost but greatly improves availability.
1. Identify critical components in the system.
2. Explain how failure of each affects the system.
3. Prioritize components by impact and likelihood of failure.
4. Suggest practical redundancy and failover solutions.
5. Discuss trade-offs between cost, complexity, and availability.
Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Add read replicas and implement connection pooling to distribute load and avoid the database becoming a single point of failure.