0
0
HLDsystem_design~7 mins

Redundancy and fault tolerance in HLD - System Design Guide

Choose your learning style9 modes available
Problem Statement
When a critical component in a system fails, the entire service can become unavailable, causing downtime and loss of user trust. Without backup components or mechanisms, a single failure can cascade and bring down the whole system.
Solution
Redundancy duplicates critical components so that if one fails, others can take over immediately. Fault tolerance ensures the system continues to operate correctly despite failures by detecting faults and switching to backups without interrupting service.
Architecture
Client App
Load Balancer (LB)
Server A
(Primary)

This diagram shows a client sending requests to a load balancer that distributes traffic between two servers. Server B acts as a backup to Server A, providing redundancy and fault tolerance.

Trade-offs
✓ Pros
Improves system availability by eliminating single points of failure.
Enables seamless failover to backup components without user impact.
Supports maintenance and upgrades without downtime by switching traffic.
Increases system reliability and user trust.
✗ Cons
Adds hardware and operational costs due to duplicate components.
Increases system complexity requiring monitoring and failover logic.
May introduce slight latency due to health checks and failover mechanisms.
Use when system uptime is critical and expected traffic exceeds 1,000 requests per second or when downtime costs are high.
Avoid when system is small-scale with low traffic (under 100 req/sec) or when cost constraints outweigh availability needs.
Real World Examples
Netflix
Uses redundant edge servers and fault-tolerant load balancers to ensure uninterrupted streaming even if some servers fail.
Amazon
Employs multiple availability zones with redundant databases and services to maintain fault tolerance during data center failures.
Google
Implements redundancy in its global network and data centers to provide fault tolerance and high availability for search and cloud services.
Alternatives
Graceful degradation
Instead of full redundancy, the system reduces functionality under failure to maintain partial service.
Use when: Use when full redundancy is too costly but some service continuity is required.
Circuit breaker
Prevents cascading failures by stopping calls to failing components rather than duplicating them.
Use when: Use when failures are transient and you want to isolate faults quickly.
Summary
Redundancy duplicates critical components to avoid single points of failure.
Fault tolerance enables systems to continue operating despite component failures.
Together, they improve availability but add cost and complexity.