0
0
Microservicessystem_design~25 mins

Health checks in containers in Microservices - System Design Exercise

Choose your learning style9 modes available
Design: Container Health Check System
Design focuses on health check mechanisms inside container orchestration environments. Out of scope are container orchestration internals and detailed alerting system design.
Functional Requirements
FR1: Containers must report their health status regularly.
FR2: Health checks should detect if a container is alive and ready to serve traffic.
FR3: The system should support both liveness and readiness probes.
FR4: Health check failures should trigger container restarts or traffic rerouting.
FR5: Health check results must be accessible for monitoring and alerting.
Non-Functional Requirements
NFR1: Health checks must run with minimal performance impact on containers.
NFR2: Health check latency should be under 1 second.
NFR3: System must support at least 10,000 containers concurrently.
NFR4: Availability target is 99.9% uptime for health check monitoring.
NFR5: Health check configuration must be flexible per container type.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Health check probes inside containers
Container runtime or orchestrator integration
Health check controller or manager
Monitoring and alerting system
Configuration management for health checks
Design Patterns
Circuit breaker pattern for unhealthy containers
Retry and backoff strategies for transient failures
Sidecar pattern for health monitoring
Push vs pull health check models
Reference Architecture
  +-------------------+       +---------------------+       +---------------------+
  |                   |       |                     |       |                     |
  |   Container A     |<----->| Health Check Manager |<----->| Monitoring & Alerting|
  | (with probes)     |       | (Controller Service) |       | System              |
  |                   |       |                     |       |                     |
  +-------------------+       +---------------------+       +---------------------+
           ^                             ^                             ^
           |                             |                             |
  +-------------------+       +---------------------+       +---------------------+
  |                   |       |                     |       |                     |
  |   Container B     |<----->| Container Runtime /  |       | Configuration Store |
  | (with probes)     |       | Orchestrator         |       | (Health check specs)|
  |                   |       |                     |       |                     |
  +-------------------+       +---------------------+       +---------------------+
Components
Container with Health Probes
Docker/Kubernetes
Runs liveness and readiness probes inside containers to report health status.
Health Check Manager
Custom microservice or Kubernetes controller
Manages health check scheduling, collects results, and triggers actions on failures.
Container Runtime / Orchestrator
Kubernetes, Docker Swarm, or similar
Executes health checks and restarts or isolates unhealthy containers.
Monitoring & Alerting System
Prometheus, Grafana, Alertmanager
Aggregates health check data, visualizes status, and sends alerts on failures.
Configuration Store
ConfigMaps, etcd, or similar
Stores health check configurations per container or service.
Request Flow
1. 1. Container runs liveness and readiness probes at configured intervals.
2. 2. Probe results are reported to the Container Runtime or directly to the Health Check Manager.
3. 3. Health Check Manager aggregates results and evaluates container health.
4. 4. If a container fails liveness probe, the orchestrator restarts the container.
5. 5. If a container fails readiness probe, traffic routing to it is stopped.
6. 6. Health Check Manager sends health status metrics to Monitoring & Alerting System.
7. 7. Monitoring system visualizes health and triggers alerts if thresholds are breached.
8. 8. Configuration Store provides health check parameters to containers and orchestrator.
Database Schema
Entities: - Container: id (PK), name, image, status - HealthCheckConfig: id (PK), container_id (FK), type (liveness/readiness), interval_seconds, timeout_seconds, protocol (HTTP/TCP/Command), endpoint - HealthCheckResult: id (PK), container_id (FK), timestamp, status (pass/fail), response_time_ms Relationships: - One Container has many HealthCheckConfigs - One Container has many HealthCheckResults
Scaling Discussion
Bottlenecks
Health Check Manager overwhelmed by large number of containers sending frequent health data.
Monitoring system storage and query performance degrade with high volume of health metrics.
Orchestrator delays in restarting or isolating unhealthy containers under heavy load.
Network overhead from frequent health check probes affecting container performance.
Solutions
Shard Health Check Manager by container groups or namespaces to distribute load.
Use time-series databases optimized for metrics (e.g., Prometheus) with retention policies.
Implement rate limiting and backoff for health checks to reduce network overhead.
Use asynchronous event-driven communication between components to improve responsiveness.
Scale orchestrator control plane horizontally and optimize restart policies.
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying health check types. Use 20 minutes to design components and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Use last 5 minutes for questions and summary.
Explain difference between liveness and readiness probes and why both are needed.
Describe how health checks integrate with container orchestration for automated recovery.
Discuss trade-offs in probe frequency and impact on performance.
Highlight monitoring and alerting importance for operational visibility.
Address scaling challenges and practical solutions for large container fleets.