Microservicessystem_design~25 mins

Health checks in containers in Microservices - System Design Exercise

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Design: Container Health Check System

Design focuses on health check mechanisms inside container orchestration environments. Out of scope are container orchestration internals and detailed alerting system design.

Functional Requirements

FR1: Containers must report their health status regularly.

FR2: Health checks should detect if a container is alive and ready to serve traffic.

FR3: The system should support both liveness and readiness probes.

FR4: Health check failures should trigger container restarts or traffic rerouting.

FR5: Health check results must be accessible for monitoring and alerting.

Non-Functional Requirements

NFR1: Health checks must run with minimal performance impact on containers.

NFR2: Health check latency should be under 1 second.

NFR3: System must support at least 10,000 containers concurrently.

NFR4: Availability target is 99.9% uptime for health check monitoring.

NFR5: Health check configuration must be flexible per container type.

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

Key Components

Health check probes inside containers

Container runtime or orchestrator integration

Health check controller or manager

Monitoring and alerting system

Configuration management for health checks

Design Patterns

Circuit breaker pattern for unhealthy containers

Retry and backoff strategies for transient failures

Sidecar pattern for health monitoring

Push vs pull health check models

Reference Architecture

  +-------------------+       +---------------------+       +---------------------+
  |                   |       |                     |       |                     |
  |   Container A     |<----->| Health Check Manager |<----->| Monitoring & Alerting|
  | (with probes)     |       | (Controller Service) |       | System              |
  |                   |       |                     |       |                     |
  +-------------------+       +---------------------+       +---------------------+
           ^                             ^                             ^
           |                             |                             |
  +-------------------+       +---------------------+       +---------------------+
  |                   |       |                     |       |                     |
  |   Container B     |<----->| Container Runtime /  |       | Configuration Store |
  | (with probes)     |       | Orchestrator         |       | (Health check specs)|
  |                   |       |                     |       |                     |
  +-------------------+       +---------------------+       +---------------------+

Components

Container with Health Probes

Docker/Kubernetes

Runs liveness and readiness probes inside containers to report health status.

Health Check Manager

Custom microservice or Kubernetes controller

Manages health check scheduling, collects results, and triggers actions on failures.

Container Runtime / Orchestrator

Kubernetes, Docker Swarm, or similar

Executes health checks and restarts or isolates unhealthy containers.

Monitoring & Alerting System

Prometheus, Grafana, Alertmanager

Aggregates health check data, visualizes status, and sends alerts on failures.

Configuration Store

ConfigMaps, etcd, or similar

Stores health check configurations per container or service.

Request Flow

1. 1. Container runs liveness and readiness probes at configured intervals.

2. 2. Probe results are reported to the Container Runtime or directly to the Health Check Manager.

3. 3. Health Check Manager aggregates results and evaluates container health.

4. 4. If a container fails liveness probe, the orchestrator restarts the container.

5. 5. If a container fails readiness probe, traffic routing to it is stopped.

6. 6. Health Check Manager sends health status metrics to Monitoring & Alerting System.

7. 7. Monitoring system visualizes health and triggers alerts if thresholds are breached.

8. 8. Configuration Store provides health check parameters to containers and orchestrator.

Database Schema

Entities: - Container: id (PK), name, image, status - HealthCheckConfig: id (PK), container_id (FK), type (liveness/readiness), interval_seconds, timeout_seconds, protocol (HTTP/TCP/Command), endpoint - HealthCheckResult: id (PK), container_id (FK), timestamp, status (pass/fail), response_time_ms Relationships: - One Container has many HealthCheckConfigs - One Container has many HealthCheckResults

Scaling Discussion

Bottlenecks

Health Check Manager overwhelmed by large number of containers sending frequent health data.

Monitoring system storage and query performance degrade with high volume of health metrics.

Orchestrator delays in restarting or isolating unhealthy containers under heavy load.

Network overhead from frequent health check probes affecting container performance.

Solutions

Shard Health Check Manager by container groups or namespaces to distribute load.

Use time-series databases optimized for metrics (e.g., Prometheus) with retention policies.

Implement rate limiting and backoff for health checks to reduce network overhead.

Use asynchronous event-driven communication between components to improve responsiveness.

Scale orchestrator control plane horizontally and optimize restart policies.

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying health check types. Use 20 minutes to design components and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Use last 5 minutes for questions and summary.

Explain difference between liveness and readiness probes and why both are needed.

Describe how health checks integrate with container orchestration for automated recovery.

Discuss trade-offs in probe frequency and impact on performance.

Highlight monitoring and alerting importance for operational visibility.

Address scaling challenges and practical solutions for large container fleets.

Practice

(1/5)

1. What is the main purpose of health checks in containers?

easy

A. To log all container network traffic

B. To increase the container's memory allocation

C. To update the container's software automatically

D. To verify if the container is running and responsive

Health checks in containers in Microservices - System Design Exercise

Start learning this pattern below

Practice

Solution

Step 1: Understand container health checks

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Recall Docker health check syntax

Step 2: Identify the correct command

Final Answer:

Quick Check:

Solution

Step 1: Understand liveness probe behavior

Step 2: Analyze the HTTP 500 response effect

Final Answer:

Quick Check:

Solution

Step 1: Check health check command correctness

Step 2: Consider container restart policy

Final Answer:

Quick Check:

Solution

Step 1: Understand liveness probe role

Step 2: Understand readiness probe role

Step 3: Combine their functions

Final Answer:

Quick Check: