0
0
Microservicessystem_design~7 mins

Health check pattern in Microservices - System Design Guide

Choose your learning style9 modes available
Problem Statement
When a microservice fails or becomes unresponsive, other services or load balancers may continue sending requests to it, causing errors and degraded user experience. Without a way to verify if a service is healthy, the system cannot automatically detect failures or reroute traffic, leading to downtime and cascading failures.
Solution
The health check pattern solves this by having each service expose a simple endpoint that reports its current status. Other components periodically call this endpoint to verify if the service is alive and functioning. If the health check fails, the service is marked unhealthy and traffic is redirected away until it recovers.
Architecture
Client
Load Balancer
Health Check
Monitor

This diagram shows a client sending requests through a load balancer to a microservice. A health check monitor periodically queries the microservice's health endpoint to determine its status and informs the load balancer to route traffic accordingly.

Trade-offs
✓ Pros
Enables automatic detection of unhealthy services to improve system reliability.
Allows load balancers and orchestrators to reroute traffic away from failing instances.
Simple to implement with minimal overhead on services.
Supports graceful recovery and reduces downtime.
✗ Cons
Requires additional infrastructure or monitoring components to poll health endpoints.
Health checks may not detect all types of failures, such as degraded performance.
Improper health check design can cause false positives or negatives, affecting routing.
Use when running multiple microservice instances behind load balancers or orchestrators, especially at scale above 100 requests per second or when high availability is critical.
Avoid in very simple or single-instance services where failure detection and rerouting are unnecessary or add complexity without benefit.
Real World Examples
Netflix
Netflix uses health checks to monitor microservices in its streaming platform, enabling automatic failover and traffic rerouting to healthy instances to maintain uninterrupted playback.
Uber
Uber employs health checks in its microservice architecture to detect service failures quickly and prevent cascading outages during high-demand periods.
Amazon
Amazon uses health checks in its AWS Elastic Load Balancer to route traffic only to healthy EC2 instances, ensuring reliable service delivery.
Code Example
The before code lacks any endpoint to report service health, so external systems cannot verify if the service is alive. The after code adds a '/health' endpoint that returns a simple JSON status. Load balancers or monitors can call this endpoint to check if the service is healthy and route traffic accordingly.
Microservices
### Before: No health check endpoint
from flask import Flask
app = Flask(__name__)

@app.route('/')
def home():
    return 'Hello World'


### After: Adding a health check endpoint
from flask import Flask, jsonify
app = Flask(__name__)

@app.route('/')
def home():
    return 'Hello World'

@app.route('/health')
def health_check():
    # Simple health check returning service status
    status = {'status': 'healthy'}
    return jsonify(status), 200
OutputSuccess
Alternatives
Circuit Breaker
Circuit breaker stops requests to a failing service after detecting repeated failures, while health check proactively monitors service status.
Use when: Use circuit breaker when you want to prevent cascading failures by stopping calls after failures, especially for transient faults.
Service Mesh
Service mesh provides built-in health checking and traffic management at the network layer, abstracting health checks from application code.
Use when: Choose service mesh when you need advanced traffic control and observability across many microservices.
Summary
Health check pattern prevents sending requests to failed or unresponsive microservices by verifying their status regularly.
It works by exposing a simple endpoint that external systems poll to determine service health and reroute traffic accordingly.
This pattern improves system reliability and availability, especially in large-scale microservice architectures.