HLDsystem_design~7 mins

Health checks in HLD - System Design Guide

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Problem Statement

When a service or server fails silently or becomes unresponsive, the system continues to send traffic to it, causing slow responses or complete outages. Without a way to detect unhealthy components, the system cannot reroute traffic or trigger recovery actions, leading to poor user experience and downtime.

Solution

Health checks periodically test if a service or server is working correctly by sending simple requests or probes. If a component fails the health check, it is marked unhealthy and removed from the pool of active servers until it recovers. This ensures traffic only goes to healthy components, improving reliability and availability.

Architecture

┌─────────────┐       ┌─────────────┐       ┌─────────────┐
│ Load Balancer│──────▶│ Health Check│──────▶│   Service   │
└─────────────┘       └─────────────┘       └─────────────┘
       │                    ▲                     │
       │                    │                     │
       │                    │                     │
       └────────────────────┴─────────────────────┘

This diagram shows a load balancer sending traffic to services only after health checks confirm they are healthy. The health check component probes the service and reports status back to the load balancer.

Trade-offs

✓ Pros

→

Improves system reliability by routing traffic only to healthy services.

→

Enables automatic detection and removal of failed components.

→

Supports proactive recovery actions like auto-scaling or alerts.

→

Simple to implement and integrates with load balancers and orchestrators.

✗ Cons

→

Adds extra network traffic and processing overhead for frequent checks.

→

Incorrect health check design can cause false positives or negatives.

→

Requires careful tuning of check frequency and timeout values.

Use health checks in any distributed system with multiple services or servers, especially when uptime and availability are critical and traffic needs to be routed dynamically.

Avoid health checks in very simple, single-server systems where failure detection is trivial or unnecessary, or when the overhead of checks outweighs benefits at very low scale.

Real World Examples

Netflix

Netflix uses health checks to monitor microservices and edge servers, ensuring traffic is only routed to healthy instances to maintain uninterrupted streaming.

Amazon

Amazon employs health checks in its load balancers to detect unhealthy EC2 instances and automatically reroute traffic to healthy ones, improving e-commerce site availability.

Google

Google Cloud Platform uses health checks to monitor VM instances and containers, enabling automatic failover and scaling decisions.

Alternatives

Heartbeat Monitoring

Instead of active probes, components send periodic 'heartbeat' signals to indicate health.

Use when: Choose heartbeat monitoring when components can push status updates and you want to reduce probe traffic.

Passive Health Checks

Health is inferred from actual traffic success/failure rather than active probes.

Use when: Choose passive checks when you want to avoid extra probe traffic and rely on real request outcomes.

Summary

Health checks detect unhealthy services to prevent routing traffic to failed components.

They improve system availability by enabling automatic failover and recovery.

Proper design and tuning of health checks are essential to avoid false positives and overhead.