Bird
Raised Fist0
Microservicessystem_design~25 mins

Liveness and readiness probes in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Microservices Health Check System with Liveness and Readiness Probes
Design focuses on the health check mechanism using liveness and readiness probes for microservices in container orchestration environments. It excludes detailed orchestration logic and deployment pipelines.
Functional Requirements
FR1: Detect if a microservice instance is alive and responsive (liveness probe).
FR2: Detect if a microservice instance is ready to serve traffic (readiness probe).
FR3: Automatically restart or remove unhealthy instances based on probe results.
FR4: Support configurable probe endpoints and intervals.
FR5: Integrate with container orchestration platforms like Kubernetes.
FR6: Minimize false positives to avoid unnecessary restarts or traffic routing.
Non-Functional Requirements
NFR1: System must handle at least 1000 microservice instances concurrently.
NFR2: Probe response time should be under 100ms to avoid delays in orchestration decisions.
NFR3: Availability target: 99.9% uptime for the health check system itself.
NFR4: Probes must not add significant load to microservices.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Probe endpoints inside microservices (HTTP /healthz, /ready)
Container orchestration health check integration (e.g., Kubernetes probes)
Health check controller or manager
Logging and alerting for probe failures
Configuration management for probe parameters
Design Patterns
Health check pattern
Circuit breaker pattern for readiness
Retry and backoff strategies for transient failures
Sidecar pattern for external health monitoring
Graceful shutdown and startup hooks
Reference Architecture
                    +-------------------------+
                    |  Container Orchestrator  |
                    |  (e.g., Kubernetes)      |
                    +-----------+-------------+
                                |
                +---------------+----------------+
                |                                |
        +-------v-------+                +-------v-------+
        | Microservice 1 |                | Microservice 2 |
        | +-----------+ |                | +-----------+ |
        | | /healthz  | |<--- Liveness --| | /healthz  | |
        | | /ready    | |<--- Readiness-| | /ready    | |
        | +-----------+ |                | +-----------+ |
        +---------------+                +---------------+

Legend:
- Orchestrator calls /healthz to check if service is alive.
- Orchestrator calls /ready to check if service is ready to receive traffic.
- Based on probe results, orchestrator restarts or routes traffic accordingly.
Components
Microservice Probe Endpoints
HTTP REST endpoints
Expose /healthz for liveness and /ready for readiness checks.
Container Orchestrator Health Checks
Kubernetes Liveness and Readiness Probes
Periodically call probe endpoints to monitor service health and readiness.
Health Check Controller
Orchestrator internal component
Manage probe results, restart unhealthy pods, and update service routing.
Configuration Management
Config files or environment variables
Set probe intervals, timeouts, and failure thresholds.
Logging and Alerting
Centralized logging system (e.g., ELK stack)
Record probe failures and notify operators.
Request Flow
1. 1. Container orchestrator sends HTTP GET request to /healthz endpoint of a microservice instance.
2. 2. Microservice responds with 200 OK if alive; otherwise, returns error or no response.
3. 3. Orchestrator marks instance as unhealthy if liveness probe fails repeatedly and restarts it.
4. 4. Orchestrator sends HTTP GET request to /ready endpoint to check if instance is ready to serve traffic.
5. 5. Microservice responds with 200 OK if ready; otherwise, returns error or no response.
6. 6. Orchestrator routes traffic only to instances passing readiness probes.
7. 7. Configuration parameters control probe frequency, timeout, and failure thresholds.
8. 8. Logs and alerts are generated on probe failures for monitoring.
Database Schema
Not applicable as probes are stateless HTTP endpoints within microservices; health state is managed by orchestrator in-memory or via its internal state store.
Scaling Discussion
Bottlenecks
High number of probe requests causing load on microservices.
Delayed detection due to long probe intervals.
False positives from transient network issues.
Orchestrator overwhelmed by managing many probe results.
Probe endpoints causing resource contention inside microservices.
Solutions
Use lightweight probe endpoints that perform minimal checks to reduce load.
Tune probe intervals and failure thresholds to balance detection speed and stability.
Implement retries and backoff in orchestrator before marking failures.
Distribute health check load across orchestrator components or use sidecar proxies.
Isolate probe handling in microservices with dedicated threads or lightweight handlers.
Interview Tips
Time: Spend 10 minutes understanding probe concepts and requirements, 15 minutes designing the probe endpoints and orchestration integration, 10 minutes discussing scaling and failure handling, and 10 minutes for Q&A.
Explain difference between liveness and readiness probes clearly.
Discuss how probes help maintain system reliability and availability.
Describe how probe failures trigger orchestrator actions like restarts or traffic routing.
Mention configuration flexibility and tuning for different workloads.
Highlight strategies to avoid false positives and minimize probe overhead.
Discuss scaling challenges and solutions for large microservice deployments.

Practice

(1/5)
1. What is the main purpose of a liveness probe in microservices?
easy
A. To check if the service is ready to accept traffic
B. To log user requests for debugging
C. To monitor the network latency between services
D. To check if the service is alive and restart it if it is not

Solution

  1. Step 1: Understand the role of liveness probes

    Liveness probes detect if a service is stuck or dead and need restarting.
  2. Step 2: Differentiate from readiness probes

    Readiness probes check if the service can handle requests, not if it is alive.
  3. Final Answer:

    To check if the service is alive and restart it if it is not -> Option D
  4. Quick Check:

    Liveness probe = check alive and restart [OK]
Hint: Liveness = alive and restart, Readiness = ready for traffic [OK]
Common Mistakes:
  • Confusing liveness with readiness probes
  • Thinking liveness probes check traffic readiness
  • Assuming liveness probes monitor performance
2. Which of the following is the correct syntax to define a readiness probe in a Kubernetes pod spec?
easy
A. livenessProbe: exec: command: ["cat", "/tmp/healthy"] timeoutSeconds: 1
B. livenessProbe: tcpSocket: port: 8080 initialDelaySeconds: 5 periodSeconds: 10
C. readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 periodSeconds: 10
D. livenessProbe: httpGet: path: /ready port: 8080 failureThreshold: 3

Solution

  1. Step 1: Identify readiness probe syntax

    Readiness probes often use httpGet with path and port, plus delay and period settings.
  2. Step 2: Confirm correct fields and indentation

    readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 periodSeconds: 10 correctly shows readinessProbe with httpGet, initialDelaySeconds, and periodSeconds.
  3. Final Answer:

    readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 periodSeconds: 10 -> Option C
  4. Quick Check:

    Readiness probe syntax = readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 periodSeconds: 10 [OK]
Hint: Readiness uses httpGet with path and port in YAML [OK]
Common Mistakes:
  • Mixing livenessProbe and readinessProbe fields
  • Incorrect indentation in YAML
  • Using wrong probe type for readiness
3. Given this Kubernetes pod spec snippet, what will happen if the readiness probe fails continuously?
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3
medium
A. The pod will be restarted immediately
B. The pod will be marked as not ready and removed from service endpoints
C. The pod will ignore the failure and continue serving traffic
D. The pod will scale up automatically

Solution

  1. Step 1: Understand readiness probe failure effect

    Readiness probe failure marks pod as not ready, so it stops receiving traffic.
  2. Step 2: Differentiate from liveness probe effect

    Liveness probe failure triggers pod restart, readiness does not.
  3. Final Answer:

    The pod will be marked as not ready and removed from service endpoints -> Option B
  4. Quick Check:

    Readiness failure = pod not ready, no restart [OK]
Hint: Readiness failure removes pod from load balancer, no restart [OK]
Common Mistakes:
  • Confusing readiness failure with pod restart
  • Assuming pod scales automatically on probe failure
  • Ignoring failureThreshold effect
4. A microservice has a liveness probe configured as an HTTP GET on /health. The service sometimes returns HTTP 500 during startup but is healthy afterward. What is the best fix to avoid unnecessary restarts?
medium
A. Increase initialDelaySeconds to allow startup time before probing
B. Change the probe to readiness probe instead of liveness probe
C. Remove the probe completely to avoid restarts
D. Set failureThreshold to 1 to detect failures faster

Solution

  1. Step 1: Identify cause of restarts

    Liveness probe fails during startup because service returns HTTP 500 before ready.
  2. Step 2: Adjust probe timing to avoid false failures

    Increasing initialDelaySeconds delays probe start, allowing service to become healthy first.
  3. Final Answer:

    Increase initialDelaySeconds to allow startup time before probing -> Option A
  4. Quick Check:

    Delay liveness probe start to avoid false failures [OK]
Hint: Delay liveness probe start to avoid false failure during startup [OK]
Common Mistakes:
  • Removing probes which reduces reliability
  • Confusing readiness and liveness probe roles
  • Setting failureThreshold too low causing quick restarts
5. You have a microservice that takes time to initialize resources before it can serve requests. You want to ensure it is not restarted unnecessarily but also not receive traffic before ready. How should you configure liveness and readiness probes?
hard
A. Set liveness probe with a longer initialDelaySeconds and readiness probe to check resource initialization
B. Use only a liveness probe with a short periodSeconds to restart fast
C. Use only a readiness probe and no liveness probe
D. Set both probes to the same HTTP path and timing

Solution

  1. Step 1: Prevent unnecessary restarts during initialization

    Set liveness probe initialDelaySeconds long enough to avoid restarting while initializing.
  2. Step 2: Use readiness probe to block traffic until ready

    Readiness probe should check if resources are initialized before accepting traffic.
  3. Final Answer:

    Set liveness probe with a longer initialDelaySeconds and readiness probe to check resource initialization -> Option A
  4. Quick Check:

    Liveness delay + readiness check = safe startup [OK]
Hint: Delay liveness, readiness blocks traffic until ready [OK]
Common Mistakes:
  • Using only one probe type causing traffic or restart issues
  • Setting same path and timing for both probes
  • Not delaying liveness probe causing premature restarts