Bird
Raised Fist0
Microservicessystem_design~25 mins

Health checks in containers in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Container Health Check System
Design focuses on health check mechanisms inside container orchestration environments. Out of scope are container orchestration internals and detailed alerting system design.
Functional Requirements
FR1: Containers must report their health status regularly.
FR2: Health checks should detect if a container is alive and ready to serve traffic.
FR3: The system should support both liveness and readiness probes.
FR4: Health check failures should trigger container restarts or traffic rerouting.
FR5: Health check results must be accessible for monitoring and alerting.
Non-Functional Requirements
NFR1: Health checks must run with minimal performance impact on containers.
NFR2: Health check latency should be under 1 second.
NFR3: System must support at least 10,000 containers concurrently.
NFR4: Availability target is 99.9% uptime for health check monitoring.
NFR5: Health check configuration must be flexible per container type.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Health check probes inside containers
Container runtime or orchestrator integration
Health check controller or manager
Monitoring and alerting system
Configuration management for health checks
Design Patterns
Circuit breaker pattern for unhealthy containers
Retry and backoff strategies for transient failures
Sidecar pattern for health monitoring
Push vs pull health check models
Reference Architecture
  +-------------------+       +---------------------+       +---------------------+
  |                   |       |                     |       |                     |
  |   Container A     |<----->| Health Check Manager |<----->| Monitoring & Alerting|
  | (with probes)     |       | (Controller Service) |       | System              |
  |                   |       |                     |       |                     |
  +-------------------+       +---------------------+       +---------------------+
           ^                             ^                             ^
           |                             |                             |
  +-------------------+       +---------------------+       +---------------------+
  |                   |       |                     |       |                     |
  |   Container B     |<----->| Container Runtime /  |       | Configuration Store |
  | (with probes)     |       | Orchestrator         |       | (Health check specs)|
  |                   |       |                     |       |                     |
  +-------------------+       +---------------------+       +---------------------+
Components
Container with Health Probes
Docker/Kubernetes
Runs liveness and readiness probes inside containers to report health status.
Health Check Manager
Custom microservice or Kubernetes controller
Manages health check scheduling, collects results, and triggers actions on failures.
Container Runtime / Orchestrator
Kubernetes, Docker Swarm, or similar
Executes health checks and restarts or isolates unhealthy containers.
Monitoring & Alerting System
Prometheus, Grafana, Alertmanager
Aggregates health check data, visualizes status, and sends alerts on failures.
Configuration Store
ConfigMaps, etcd, or similar
Stores health check configurations per container or service.
Request Flow
1. 1. Container runs liveness and readiness probes at configured intervals.
2. 2. Probe results are reported to the Container Runtime or directly to the Health Check Manager.
3. 3. Health Check Manager aggregates results and evaluates container health.
4. 4. If a container fails liveness probe, the orchestrator restarts the container.
5. 5. If a container fails readiness probe, traffic routing to it is stopped.
6. 6. Health Check Manager sends health status metrics to Monitoring & Alerting System.
7. 7. Monitoring system visualizes health and triggers alerts if thresholds are breached.
8. 8. Configuration Store provides health check parameters to containers and orchestrator.
Database Schema
Entities: - Container: id (PK), name, image, status - HealthCheckConfig: id (PK), container_id (FK), type (liveness/readiness), interval_seconds, timeout_seconds, protocol (HTTP/TCP/Command), endpoint - HealthCheckResult: id (PK), container_id (FK), timestamp, status (pass/fail), response_time_ms Relationships: - One Container has many HealthCheckConfigs - One Container has many HealthCheckResults
Scaling Discussion
Bottlenecks
Health Check Manager overwhelmed by large number of containers sending frequent health data.
Monitoring system storage and query performance degrade with high volume of health metrics.
Orchestrator delays in restarting or isolating unhealthy containers under heavy load.
Network overhead from frequent health check probes affecting container performance.
Solutions
Shard Health Check Manager by container groups or namespaces to distribute load.
Use time-series databases optimized for metrics (e.g., Prometheus) with retention policies.
Implement rate limiting and backoff for health checks to reduce network overhead.
Use asynchronous event-driven communication between components to improve responsiveness.
Scale orchestrator control plane horizontally and optimize restart policies.
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying health check types. Use 20 minutes to design components and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Use last 5 minutes for questions and summary.
Explain difference between liveness and readiness probes and why both are needed.
Describe how health checks integrate with container orchestration for automated recovery.
Discuss trade-offs in probe frequency and impact on performance.
Highlight monitoring and alerting importance for operational visibility.
Address scaling challenges and practical solutions for large container fleets.

Practice

(1/5)
1. What is the main purpose of health checks in containers?
easy
A. To log all container network traffic
B. To increase the container's memory allocation
C. To update the container's software automatically
D. To verify if the container is running and responsive

Solution

  1. Step 1: Understand container health checks

    Health checks are used to confirm if a container is alive and working properly.
  2. Step 2: Identify the main goal

    The main goal is to detect if the container is responsive and healthy, so it can be restarted if needed.
  3. Final Answer:

    To verify if the container is running and responsive -> Option D
  4. Quick Check:

    Health checks = verify container health [OK]
Hint: Health checks confirm container responsiveness [OK]
Common Mistakes:
  • Confusing health checks with resource allocation
  • Thinking health checks update software
  • Assuming health checks log network data
2. Which of the following is the correct syntax to define a simple HTTP health check in a Docker container?
easy
A. HEALTHCHECK EXECUTE curl -f http://localhost/
B. HEALTHCHECK RUN curl http://localhost/
C. HEALTHCHECK CMD curl -f http://localhost/ || exit 1
D. HEALTHCHECK CHECK curl http://localhost/

Solution

  1. Step 1: Recall Docker health check syntax

    The correct Dockerfile syntax uses HEALTHCHECK CMD followed by a command that returns 0 on success.
  2. Step 2: Identify the correct command

    HEALTHCHECK CMD curl -f http://localhost/ || exit 1 uses 'curl -f' which fails on HTTP errors and 'exit 1' on failure, matching best practice.
  3. Final Answer:

    HEALTHCHECK CMD curl -f http://localhost/ || exit 1 -> Option C
  4. Quick Check:

    Docker healthcheck syntax = HEALTHCHECK CMD [OK]
Hint: Docker healthchecks use 'HEALTHCHECK CMD' syntax [OK]
Common Mistakes:
  • Using RUN instead of CMD in HEALTHCHECK
  • Using EXECUTE or CHECK which are invalid keywords
  • Not handling failure with exit codes
3. Consider this Kubernetes liveness probe configuration snippet:
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
What happens if the container's /health endpoint returns HTTP 500 continuously?
medium
A. Kubernetes restarts the container after failing the liveness probe
B. Kubernetes ignores the failure and keeps the container running
C. Kubernetes scales up the number of containers
D. Kubernetes shuts down the entire pod immediately

Solution

  1. Step 1: Understand liveness probe behavior

    Liveness probes check if a container is alive; failure triggers a restart of that container.
  2. Step 2: Analyze the HTTP 500 response effect

    HTTP 500 means the endpoint is unhealthy, so Kubernetes marks the probe as failed and restarts the container.
  3. Final Answer:

    Kubernetes restarts the container after failing the liveness probe -> Option A
  4. Quick Check:

    Liveness probe failure = container restart [OK]
Hint: Liveness failure triggers container restart [OK]
Common Mistakes:
  • Thinking Kubernetes ignores liveness failures
  • Confusing liveness probe with scaling behavior
  • Assuming pod shutdown instead of container restart
4. You have this Dockerfile snippet:
HEALTHCHECK CMD curl -f http://localhost:5000/health || exit 1
But the container never restarts even when the service is down. What is the likely issue?
medium
A. The container restart policy is not set to restart on failure
B. The container does not expose port 5000
C. The health check command is missing the --interval option
D. The HEALTHCHECK CMD syntax is incorrect

Solution

  1. Step 1: Check health check command correctness

    The command syntax is correct and uses curl -f with exit 1 on failure.
  2. Step 2: Consider container restart policy

    If the container restart policy is not set to restart on failure, the container won't restart despite health check failures.
  3. Final Answer:

    The container restart policy is not set to restart on failure -> Option A
  4. Quick Check:

    Restart policy controls container restart on health failure [OK]
Hint: Check restart policy if container doesn't restart [OK]
Common Mistakes:
  • Assuming health check command syntax is wrong
  • Ignoring restart policy settings
  • Thinking missing --interval causes no restart
5. You want to design a microservice container that uses both readiness and liveness probes. Which of the following best describes their combined use?
hard
A. Both probes only log health status without affecting container state
B. Liveness probe restarts unhealthy containers; readiness probe controls traffic routing to only ready containers
C. Both probes restart containers on failure
D. Readiness probe restarts containers; liveness probe controls traffic routing

Solution

  1. Step 1: Understand liveness probe role

    Liveness probes detect if a container is alive; failure triggers container restart.
  2. Step 2: Understand readiness probe role

    Readiness probes check if a container is ready to serve traffic; failure removes it from load balancer routing.
  3. Step 3: Combine their functions

    Liveness restarts unhealthy containers; readiness controls traffic flow to only healthy containers.
  4. Final Answer:

    Liveness probe restarts unhealthy containers; readiness probe controls traffic routing to only ready containers -> Option B
  5. Quick Check:

    Liveness = restart, Readiness = traffic control [OK]
Hint: Liveness restarts; readiness controls traffic [OK]
Common Mistakes:
  • Mixing up readiness and liveness roles
  • Thinking readiness probe restarts containers
  • Assuming probes only log status without action