0
0
HLDsystem_design~25 mins

Health checks in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Health Check System for Distributed Services
In scope: Designing the health check system architecture, data flow, and scaling. Out of scope: Implementation details of individual service health checks or alerting system internals.
Functional Requirements
FR1: Periodically verify the status of multiple services in a distributed system
FR2: Detect if a service is up, down, or degraded
FR3: Provide a dashboard or API to show current health status of all services
FR4: Send alerts when a service becomes unhealthy
FR5: Support different types of health checks: simple ping, HTTP status, and custom checks
FR6: Allow configuration of check frequency and timeout per service
Non-Functional Requirements
NFR1: Must handle monitoring at least 1000 services concurrently
NFR2: Health check latency should be under 1 second per check
NFR3: System availability target: 99.9% uptime
NFR4: Minimal impact on monitored services (lightweight checks)
NFR5: Scalable to add more services without major redesign
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Health check scheduler to trigger checks at configured intervals
Health check executor to perform actual checks
Data store to save health status and history
API or dashboard to display health information
Alerting mechanism for unhealthy services
Load balancer or queue to distribute health check tasks
Design Patterns
Polling pattern for periodic checks
Circuit breaker pattern to avoid repeated checks on failing services
Caching to reduce load and improve response times
Bulkhead pattern to isolate failures
Event-driven architecture for alerting
Reference Architecture
                +---------------------+
                |   Health Check UI    |
                +----------+----------+
                           |
                           v
                +---------------------+
                |  Health Check API    |
                +----------+----------+
                           |
           +---------------+----------------+
           |                                |
+----------v----------+           +---------v----------+
| Health Check        |           | Alerting Service   |
| Scheduler           |           +--------------------+
+----------+----------+
           |
           v
+---------------------+
| Health Check        |
| Executor Pool       |
+----------+----------+
           |
           v
+---------------------+
| Monitored Services  |
+---------------------+

Data Store (e.g. time-series DB) connected to API and Scheduler for storing health status
Components
Health Check Scheduler
Custom service or cron-based scheduler
Triggers health checks at configured intervals for each service
Health Check Executor Pool
Thread pool or worker queue system
Performs actual health checks concurrently without overloading services
Health Check API
REST API server
Provides access to current and historical health data for UI and external clients
Health Check UI
Web dashboard
Displays health status of all monitored services in real-time
Alerting Service
Notification system (email, SMS, webhook)
Sends alerts when services become unhealthy or recover
Data Store
Time-series database (e.g., Prometheus, InfluxDB)
Stores health check results and history for analysis and display
Request Flow
1. 1. Scheduler reads service list and check configurations.
2. 2. Scheduler triggers health check tasks at configured intervals.
3. 3. Executor pool performs health checks (ping, HTTP, custom) concurrently.
4. 4. Executor reports results to Data Store.
5. 5. API reads health data from Data Store to serve UI and external queries.
6. 6. Alerting Service subscribes to health status changes and sends notifications.
7. 7. UI displays real-time health status and history to users.
Database Schema
Entities: - Service: id, name, type, check_config (frequency, timeout, check_type) - HealthCheckResult: id, service_id (FK), timestamp, status (up/down/degraded), response_time, details Relationships: - One Service has many HealthCheckResults - HealthCheckResult stores each check's outcome for a service
Scaling Discussion
Bottlenecks
Scheduler overwhelmed by large number of services to check frequently
Executor pool saturates causing delays in health checks
Data Store write/read bottlenecks with high volume of health data
Alerting system overwhelmed by frequent status changes
UI/API latency increases with large data volume
Solutions
Partition services and distribute scheduling across multiple scheduler instances
Scale executor pool horizontally with worker queues and rate limiting per service
Use scalable time-series databases with sharding and retention policies
Implement alert deduplication and rate limiting to reduce noise
Cache recent health data and paginate UI/API responses for performance
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain importance of lightweight, periodic health checks to avoid service overload
Discuss different types of health checks and their trade-offs
Describe how scheduler and executor components work together
Highlight data storage choice for time-series health data
Address alerting strategy to avoid alert fatigue
Discuss scaling challenges and solutions clearly
Show awareness of availability and latency targets