Bird
Raised Fist0
Microservicessystem_design~7 mins

Health check pattern in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
When a microservice fails or becomes unresponsive, other services or load balancers may continue sending requests to it, causing errors and degraded user experience. Without a way to verify if a service is healthy, the system cannot automatically detect failures or reroute traffic, leading to downtime and cascading failures.
Solution
The health check pattern solves this by having each service expose a simple endpoint that reports its current status. Other components periodically call this endpoint to verify if the service is alive and functioning. If the health check fails, the service is marked unhealthy and traffic is redirected away until it recovers.
Architecture
Client
Load Balancer
Health Check
Monitor

This diagram shows a client sending requests through a load balancer to a microservice. A health check monitor periodically queries the microservice's health endpoint to determine its status and informs the load balancer to route traffic accordingly.

Trade-offs
✓ Pros
Enables automatic detection of unhealthy services to improve system reliability.
Allows load balancers and orchestrators to reroute traffic away from failing instances.
Simple to implement with minimal overhead on services.
Supports graceful recovery and reduces downtime.
✗ Cons
Requires additional infrastructure or monitoring components to poll health endpoints.
Health checks may not detect all types of failures, such as degraded performance.
Improper health check design can cause false positives or negatives, affecting routing.
Use when running multiple microservice instances behind load balancers or orchestrators, especially at scale above 100 requests per second or when high availability is critical.
Avoid in very simple or single-instance services where failure detection and rerouting are unnecessary or add complexity without benefit.
Real World Examples
Netflix
Netflix uses health checks to monitor microservices in its streaming platform, enabling automatic failover and traffic rerouting to healthy instances to maintain uninterrupted playback.
Uber
Uber employs health checks in its microservice architecture to detect service failures quickly and prevent cascading outages during high-demand periods.
Amazon
Amazon uses health checks in its AWS Elastic Load Balancer to route traffic only to healthy EC2 instances, ensuring reliable service delivery.
Code Example
The before code lacks any endpoint to report service health, so external systems cannot verify if the service is alive. The after code adds a '/health' endpoint that returns a simple JSON status. Load balancers or monitors can call this endpoint to check if the service is healthy and route traffic accordingly.
Microservices
### Before: No health check endpoint
from flask import Flask
app = Flask(__name__)

@app.route('/')
def home():
    return 'Hello World'


### After: Adding a health check endpoint
from flask import Flask, jsonify
app = Flask(__name__)

@app.route('/')
def home():
    return 'Hello World'

@app.route('/health')
def health_check():
    # Simple health check returning service status
    status = {'status': 'healthy'}
    return jsonify(status), 200
OutputSuccess
Alternatives
Circuit Breaker
Circuit breaker stops requests to a failing service after detecting repeated failures, while health check proactively monitors service status.
Use when: Use circuit breaker when you want to prevent cascading failures by stopping calls after failures, especially for transient faults.
Service Mesh
Service mesh provides built-in health checking and traffic management at the network layer, abstracting health checks from application code.
Use when: Choose service mesh when you need advanced traffic control and observability across many microservices.
Summary
Health check pattern prevents sending requests to failed or unresponsive microservices by verifying their status regularly.
It works by exposing a simple endpoint that external systems poll to determine service health and reroute traffic accordingly.
This pattern improves system reliability and availability, especially in large-scale microservice architectures.

Practice

(1/5)
1. What is the main purpose of the health check pattern in microservices?
easy
A. To regularly verify if a service is running and responsive
B. To increase the size of the service database
C. To encrypt communication between services
D. To deploy new versions of the service automatically

Solution

  1. Step 1: Understand the health check pattern purpose

    The health check pattern is designed to monitor if a microservice is alive and functioning properly by sending simple requests to it.
  2. Step 2: Identify the correct purpose among options

    Only To regularly verify if a service is running and responsive describes this monitoring function, while others describe unrelated tasks like database size, encryption, or deployment.
  3. Final Answer:

    To regularly verify if a service is running and responsive -> Option A
  4. Quick Check:

    Health check = verify service status [OK]
Hint: Health check means checking if service is alive and working [OK]
Common Mistakes:
  • Confusing health check with deployment or encryption
  • Thinking health check changes service data
  • Assuming health check increases service capacity
2. Which of the following is the correct way to implement a health check endpoint in a microservice?
easy
A. Create an endpoint like /health that returns status 200 if service is healthy
B. Create an endpoint that deletes all data when called
C. Create an endpoint that returns service logs only
D. Create an endpoint that restarts the service automatically

Solution

  1. Step 1: Identify typical health check endpoint behavior

    A health check endpoint usually responds with a simple status code like 200 OK to indicate the service is healthy.
  2. Step 2: Match this behavior with the options

    Create an endpoint like /health that returns status 200 if service is healthy matches this by returning status 200 on /health. Other options perform unrelated or harmful actions.
  3. Final Answer:

    Create an endpoint like /health that returns status 200 if service is healthy -> Option A
  4. Quick Check:

    Health endpoint = status 200 OK [OK]
Hint: Health endpoint returns 200 OK if service is fine [OK]
Common Mistakes:
  • Confusing health check with data deletion
  • Expecting health check to return logs
  • Assuming health check restarts service
3. Consider this pseudocode for a health check endpoint:
function healthCheck() {
  if (database.isConnected() && cache.isAvailable()) {
    return { status: 200, message: 'Healthy' };
  } else {
    return { status: 503, message: 'Unhealthy' };
  }
}
What will the endpoint return if the database is connected but the cache is down?
medium
A. An error is thrown
B. { status: 200, message: 'Healthy' }
C. { status: 404, message: 'Not Found' }
D. { status: 503, message: 'Unhealthy' }

Solution

  1. Step 1: Analyze the condition in the healthCheck function

    The function returns healthy only if both database.isConnected() and cache.isAvailable() are true.
  2. Step 2: Evaluate the given scenario

    Database is connected (true), cache is down (false), so the condition is false and the function returns status 503 with 'Unhealthy'.
  3. Final Answer:

    { status: 503, message: 'Unhealthy' } -> Option D
  4. Quick Check:

    Both checks true = 200, else 503 [OK]
Hint: Both dependencies must be healthy for 200 OK [OK]
Common Mistakes:
  • Assuming partial health returns 200 OK
  • Confusing 503 with 404 status
  • Expecting an error instead of status response
4. A microservice health check endpoint is implemented as follows:
GET /health
Response: { "status": "ok" }
But monitoring tools report the service as unhealthy. What is the likely problem?
medium
A. The service is actually down and not responding
B. The endpoint does not return an HTTP status code 200
C. The endpoint URL should be /status instead of /health
D. The response body is missing the word 'healthy'

Solution

  1. Step 1: Understand monitoring tool expectations

    Most monitoring tools expect the health check endpoint to return HTTP status code 200 to mark service healthy.
  2. Step 2: Identify the issue with the current implementation

    The response body contains status 'ok' but if the HTTP status code is not 200, tools may mark it unhealthy.
  3. Final Answer:

    The endpoint does not return an HTTP status code 200 -> Option B
  4. Quick Check:

    Health check needs HTTP 200 status [OK]
Hint: Health check must return HTTP 200 status code [OK]
Common Mistakes:
  • Thinking response body text controls health status
  • Assuming endpoint URL name matters
  • Ignoring actual service availability
5. You design a microservice system with multiple services. To improve reliability, you want to implement health checks that also verify database and cache connectivity. Which approach best follows the health check pattern?
hard
A. Each service exposes a /health endpoint that always returns 200 regardless of dependency status
B. Only one central service exposes a health check endpoint for all services combined
C. Each service exposes a /health endpoint that returns 200 only if all dependencies (database, cache) are reachable; otherwise returns 503
D. Health checks are done by querying the database directly without service endpoints

Solution

  1. Step 1: Understand health check pattern for dependencies

    Health checks should verify the service and its critical dependencies like database and cache to ensure full functionality.
  2. Step 2: Evaluate the options for best practice

    Each service exposes a /health endpoint that returns 200 only if all dependencies (database, cache) are reachable; otherwise returns 503 correctly implements a health endpoint that returns 200 only if all dependencies are healthy, otherwise 503. This supports automatic recovery and monitoring.
  3. Final Answer:

    Each service exposes a /health endpoint that returns 200 only if all dependencies (database, cache) are reachable; otherwise returns 503 -> Option C
  4. Quick Check:

    Health check includes dependencies, returns 200 or 503 [OK]
Hint: Health check must verify all critical dependencies [OK]
Common Mistakes:
  • Ignoring dependency status in health check
  • Relying on a single central health endpoint
  • Checking dependencies outside service endpoints