0
0
HLDsystem_design~10 mins

Health checks in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Health checks
Growth Table: Health Checks at Different Scales
UsersHealth Check FrequencyNumber of ServicesMonitoring LoadAlerting Complexity
100 usersEvery 30 seconds1-5 servicesLowSimple
10,000 usersEvery 15 seconds10-50 servicesModerateModerate
1,000,000 usersEvery 5 seconds100-500 servicesHighComplex
100,000,000 usersEvery 1-2 seconds1000+ servicesVery HighVery Complex
First Bottleneck

The first bottleneck is the monitoring system's capacity to process health check requests and responses. As the number of services and frequency of checks increase, the monitoring server CPU, memory, and network bandwidth can become overwhelmed. This leads to delayed or missed health status updates, reducing system reliability.

Scaling Solutions
  • Horizontal Scaling: Deploy multiple monitoring servers to distribute health check load.
  • Health Check Aggregation: Use local agents or sidecars to aggregate health data before sending to central monitoring.
  • Adaptive Frequency: Reduce check frequency for stable services, increase for critical or unstable ones.
  • Caching and Throttling: Cache recent health results and throttle redundant checks to reduce load.
  • Asynchronous Checks: Use event-driven or push-based health reporting instead of polling.
  • Sharding Monitoring Data: Partition monitoring data by service groups or regions to reduce single point load.
Back-of-Envelope Cost Analysis

Assuming each health check request and response is ~1 KB:

  • At 1,000 services checked every 5 seconds: 1,000 * (1/5) = 200 checks/sec.
  • Network bandwidth: 200 KB/sec (~1.6 Mbps) - manageable on typical servers.
  • CPU: Each check requires processing; 200 checks/sec is moderate load.
  • Storage: Logs and history can grow quickly; consider retention policies.

At 100,000 services with 1-second checks, load is 100,000 checks/sec, requiring distributed monitoring and efficient aggregation.

Interview Tip

When discussing health checks scalability, start by defining the scale (number of services, check frequency). Identify the monitoring system as the bottleneck. Propose solutions like horizontal scaling, aggregation, and adaptive frequency. Discuss trade-offs between check freshness and system load. Highlight cost and complexity implications.

Self Check Question

Your monitoring system handles 1,000 health check requests per second. Traffic grows 10x. What do you do first and why?

Answer: First, implement horizontal scaling by adding more monitoring servers to distribute the load. This prevents overload and maintains timely health status updates.

Key Result
Health checks scale by increasing monitoring capacity and optimizing check frequency; the monitoring system is the first bottleneck as service count and check frequency grow.