0
0
LLDsystem_design~7 mins

Availability checking in LLD - System Design Guide

Choose your learning style9 modes available
Problem Statement
When a system depends on external services or components, failures or downtime in those parts can cause the entire system to become unresponsive or crash. Without checking if these dependencies are available before use, the system may waste resources waiting or fail unexpectedly, leading to poor user experience and instability.
Solution
Availability checking involves proactively verifying if a service or component is reachable and responsive before attempting to use it. This can be done by sending lightweight requests or health checks and using the results to decide whether to proceed, retry, or fallback. This approach prevents cascading failures and improves system resilience by avoiding calls to unavailable parts.
Architecture
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Client System │──────▶│ Availability  │──────▶│ External      │
│               │       │ Checker       │       │ Service       │
└───────────────┘       └───────────────┘       └───────────────┘
         │                      │                       │
         │                      │◀──────────────────────┤
         │                      │   Health Check Result  │
         │◀─────────────────────┤                       │

This diagram shows a client system sending requests to an availability checker, which performs health checks on an external service before allowing the client to proceed.

Trade-offs
✓ Pros
Prevents system calls to unavailable services, reducing wasted resources.
Improves user experience by failing fast or using fallback options.
Helps isolate failures and avoid cascading system crashes.
✗ Cons
Adds latency due to extra health check requests.
Requires maintenance of health check logic and endpoints.
May produce false negatives if health checks are too strict or network is unstable.
Use when your system depends on external services or components that can fail or become unreachable, especially if those dependencies affect user-facing features or critical workflows.
Avoid if your system is fully self-contained with no external dependencies, or if the overhead of health checks outweighs the benefit at very low scale (e.g., under 100 requests per minute).
Real World Examples
Netflix
Netflix uses availability checking to monitor microservices before routing user requests, ensuring only healthy services receive traffic to maintain smooth streaming.
Uber
Uber performs availability checks on payment gateways and mapping services to quickly detect failures and switch to fallback options, preventing user transaction failures.
Amazon
Amazon checks availability of inventory and shipping services before confirming orders, avoiding order processing delays or errors.
Code Example
The before code calls the external service directly without checking if it is available, risking failures. The after code adds an AvailabilityChecker that sends a health check request before calling the service, preventing calls when the service is down.
LLD
### Before Availability Checking (naive call)
class ExternalServiceClient:
    def get_data(self):
        # Directly call external service without checking
        response = self.call_service()
        return response

    def call_service(self):
        # Simulate service call
        return "data"


### After Applying Availability Checking
import requests

class AvailabilityChecker:
    def __init__(self, health_url):
        self.health_url = health_url

    def is_available(self):
        try:
            response = requests.get(self.health_url, timeout=1)
            return response.status_code == 200
        except requests.RequestException:
            return False

class ExternalServiceClientWithCheck:
    def __init__(self, health_url):
        self.checker = AvailabilityChecker(health_url)

    def get_data(self):
        if not self.checker.is_available():
            raise Exception("Service unavailable")
        response = self.call_service()
        return response

    def call_service(self):
        # Simulate service call
        return "data"
OutputSuccess
Alternatives
Circuit Breaker
Circuit breaker not only checks availability but also stops calls to failing services temporarily after repeated failures, automatically recovering after a timeout.
Use when: Choose circuit breaker when you want to prevent repeated calls to failing services and automatically recover without manual intervention.
Retry Pattern
Retry pattern attempts to call a service multiple times after failure without proactively checking availability first.
Use when: Choose retry when failures are transient and you expect the service to recover quickly without needing explicit health checks.
Summary
Availability checking prevents system failures by verifying external services are reachable before use.
It improves resilience by avoiding calls to down services and enabling fallback strategies.
This pattern is essential when systems depend on unreliable or variable external components.