HLDsystem_design~25 mins

Health checks in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Health Check System for Distributed Services

In scope: Designing the health check system architecture, data flow, and scaling. Out of scope: Implementation details of individual service health checks or alerting system internals.

Functional Requirements

FR1: Periodically verify the status of multiple services in a distributed system

FR2: Detect if a service is up, down, or degraded

FR3: Provide a dashboard or API to show current health status of all services

FR4: Send alerts when a service becomes unhealthy

FR5: Support different types of health checks: simple ping, HTTP status, and custom checks

FR6: Allow configuration of check frequency and timeout per service

Non-Functional Requirements

NFR1: Must handle monitoring at least 1000 services concurrently

NFR2: Health check latency should be under 1 second per check

NFR3: System availability target: 99.9% uptime

NFR4: Minimal impact on monitored services (lightweight checks)

NFR5: Scalable to add more services without major redesign

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Health check scheduler to trigger checks at configured intervals

Health check executor to perform actual checks

Data store to save health status and history

API or dashboard to display health information

Alerting mechanism for unhealthy services

Load balancer or queue to distribute health check tasks

Design Patterns

Polling pattern for periodic checks

Circuit breaker pattern to avoid repeated checks on failing services

Caching to reduce load and improve response times

Bulkhead pattern to isolate failures

Event-driven architecture for alerting

Reference Architecture

                +---------------------+
                |   Health Check UI    |
                +----------+----------+
                           |
                           v
                +---------------------+
                |  Health Check API    |
                +----------+----------+
                           |
           +---------------+----------------+
           |                                |
+----------v----------+           +---------v----------+
| Health Check        |           | Alerting Service   |
| Scheduler           |           +--------------------+
+----------+----------+
           |
           v
+---------------------+
| Health Check        |
| Executor Pool       |
+----------+----------+
           |
           v
+---------------------+
| Monitored Services  |
+---------------------+

Data Store (e.g. time-series DB) connected to API and Scheduler for storing health status

Components

Health Check Scheduler

Custom service or cron-based scheduler

Triggers health checks at configured intervals for each service

Health Check Executor Pool

Thread pool or worker queue system

Performs actual health checks concurrently without overloading services

Health Check API

REST API server

Provides access to current and historical health data for UI and external clients

Health Check UI

Web dashboard

Displays health status of all monitored services in real-time

Alerting Service

Notification system (email, SMS, webhook)

Sends alerts when services become unhealthy or recover

Data Store

Time-series database (e.g., Prometheus, InfluxDB)

Stores health check results and history for analysis and display

Request Flow

1. 1. Scheduler reads service list and check configurations.

2. 2. Scheduler triggers health check tasks at configured intervals.

3. 3. Executor pool performs health checks (ping, HTTP, custom) concurrently.

4. 4. Executor reports results to Data Store.

5. 5. API reads health data from Data Store to serve UI and external queries.

6. 6. Alerting Service subscribes to health status changes and sends notifications.

7. 7. UI displays real-time health status and history to users.

Database Schema

Entities: - Service: id, name, type, check_config (frequency, timeout, check_type) - HealthCheckResult: id, service_id (FK), timestamp, status (up/down/degraded), response_time, details Relationships: - One Service has many HealthCheckResults - HealthCheckResult stores each check's outcome for a service

Scaling Discussion

Bottlenecks

Scheduler overwhelmed by large number of services to check frequently

Executor pool saturates causing delays in health checks

Data Store write/read bottlenecks with high volume of health data

Alerting system overwhelmed by frequent status changes

UI/API latency increases with large data volume

Solutions

Partition services and distribute scheduling across multiple scheduler instances

Scale executor pool horizontally with worker queues and rate limiting per service

Use scalable time-series databases with sharding and retention policies

Implement alert deduplication and rate limiting to reduce noise

Cache recent health data and paginate UI/API responses for performance

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain importance of lightweight, periodic health checks to avoid service overload

Discuss different types of health checks and their trade-offs

Describe how scheduler and executor components work together

Highlight data storage choice for time-series health data

Address alerting strategy to avoid alert fatigue

Discuss scaling challenges and solutions clearly

Show awareness of availability and latency targets