HLDsystem_design~10 mins

Health checks in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - Health checks

Growth Table: Health Checks at Different Scales

Users	Health Check Frequency	Number of Services	Monitoring Load	Alerting Complexity
100 users	Every 30 seconds	1-5 services	Low	Simple
10,000 users	Every 15 seconds	10-50 services	Moderate	Moderate
1,000,000 users	Every 5 seconds	100-500 services	High	Complex
100,000,000 users	Every 1-2 seconds	1000+ services	Very High	Very Complex

First Bottleneck

The first bottleneck is the monitoring system's capacity to process health check requests and responses. As the number of services and frequency of checks increase, the monitoring server CPU, memory, and network bandwidth can become overwhelmed. This leads to delayed or missed health status updates, reducing system reliability.

Scaling Solutions

Horizontal Scaling: Deploy multiple monitoring servers to distribute health check load.
Health Check Aggregation: Use local agents or sidecars to aggregate health data before sending to central monitoring.
Adaptive Frequency: Reduce check frequency for stable services, increase for critical or unstable ones.
Caching and Throttling: Cache recent health results and throttle redundant checks to reduce load.
Asynchronous Checks: Use event-driven or push-based health reporting instead of polling.
Sharding Monitoring Data: Partition monitoring data by service groups or regions to reduce single point load.

Back-of-Envelope Cost Analysis

Assuming each health check request and response is ~1 KB:

At 1,000 services checked every 5 seconds: 1,000 * (1/5) = 200 checks/sec.
Network bandwidth: 200 KB/sec (~1.6 Mbps) - manageable on typical servers.
CPU: Each check requires processing; 200 checks/sec is moderate load.
Storage: Logs and history can grow quickly; consider retention policies.

At 100,000 services with 1-second checks, load is 100,000 checks/sec, requiring distributed monitoring and efficient aggregation.

Interview Tip

When discussing health checks scalability, start by defining the scale (number of services, check frequency). Identify the monitoring system as the bottleneck. Propose solutions like horizontal scaling, aggregation, and adaptive frequency. Discuss trade-offs between check freshness and system load. Highlight cost and complexity implications.

Self Check Question

Your monitoring system handles 1,000 health check requests per second. Traffic grows 10x. What do you do first and why?

Answer: First, implement horizontal scaling by adding more monitoring servers to distribute the load. This prevents overload and maintains timely health status updates.

Key Result

Health checks scale by increasing monitoring capacity and optimizing check frequency; the monitoring system is the first bottleneck as service count and check frequency grow.