0
0
Microservicessystem_design~7 mins

Graceful degradation in Microservices - System Design Guide

Choose your learning style9 modes available
Problem Statement
When a microservice or its dependencies fail or become slow, the entire system can become unresponsive or crash, causing a poor user experience and potential data loss.
Solution
Graceful degradation allows the system to continue operating with reduced functionality instead of failing completely. It detects failures or slowdowns and switches to fallback behaviors, such as serving cached data or disabling non-critical features, to maintain core service availability.
Architecture
Client
API Gateway
Cache / Circuit
Cache / Circuit

This diagram shows a client request flowing through an API Gateway to a microservice. If the microservice fails or is slow, the fallback mechanism activates, possibly using cache or circuit breaker to provide degraded but available responses.

Trade-offs
✓ Pros
Improves system availability by preventing total failure during partial outages.
Enhances user experience by providing partial functionality instead of errors.
Allows time for recovery or repair without disrupting all users.
✗ Cons
Requires additional development effort to implement fallback logic and detect failures.
May serve stale or incomplete data, which can confuse users or cause inconsistencies.
Complexity increases as fallback paths must be maintained and tested.
Use when your microservices handle critical user-facing features and you expect occasional failures or latency spikes, especially at scale above thousands of requests per second.
Avoid if your system requires strict consistency and cannot tolerate stale or partial data, or if your traffic is very low (under 100 requests per second) where simpler retry logic suffices.
Real World Examples
Netflix
Netflix uses graceful degradation to serve video streaming with reduced quality or cached metadata when some backend services are slow or down, ensuring continuous playback.
Amazon
Amazon degrades non-essential features like recommendations or reviews during high load or failures, so customers can still complete purchases.
Uber
Uber degrades map features or surge pricing calculations temporarily when related microservices fail, allowing ride requests to continue.
Code Example
The before code calls the profile service directly and fails if the service is down. The after code adds a fallback using cached data with an LRU cache decorator. If the service call fails, it returns cached data, enabling graceful degradation.
Microservices
### Before: No graceful degradation

def get_user_profile(user_id):
    # Direct call to microservice
    response = call_profile_service(user_id)
    return response.data


### After: With graceful degradation
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_cached_profile(user_id):
    # Cached fallback
    response = call_profile_service(user_id)
    return response.data

def get_user_profile(user_id):
    try:
        response = call_profile_service(user_id)
        return response.data
    except ServiceUnavailableError:
        # Fallback to cached data
        return get_cached_profile(user_id)

OutputSuccess
Alternatives
Circuit Breaker
Circuit breaker stops calls to failing services to prevent cascading failures but does not provide fallback content itself.
Use when: Choose circuit breaker when you want to isolate failures quickly but fallback content is not needed or handled separately.
Bulkhead
Bulkhead isolates resources per service or feature to contain failures, rather than degrading functionality.
Use when: Choose bulkhead when you want to prevent failure spread but maintain full functionality within isolated partitions.
Retry with Exponential Backoff
Retries failed requests instead of degrading functionality, aiming to recover from transient errors.
Use when: Choose retries when failures are expected to be short-lived and full functionality is critical.
Summary
Graceful degradation helps microservices keep working with limited features during failures.
It improves user experience by avoiding total outages and providing fallback content.
It requires careful design to balance availability, data freshness, and complexity.