Microservicessystem_design~7 mins

Graceful degradation in Microservices - System Design Guide

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Problem Statement

When a microservice or its dependencies fail or become slow, the entire system can become unresponsive or crash, causing a poor user experience and potential data loss.

Solution

Graceful degradation allows the system to continue operating with reduced functionality instead of failing completely. It detects failures or slowdowns and switches to fallback behaviors, such as serving cached data or disabling non-critical features, to maintain core service availability.

Architecture

Client

→API Gateway

↓

Cache / Circuit

This diagram shows a client request flowing through an API Gateway to a microservice. If the microservice fails or is slow, the fallback mechanism activates, possibly using cache or circuit breaker to provide degraded but available responses.

Trade-offs

✓ Pros

→

Improves system availability by preventing total failure during partial outages.

→

Enhances user experience by providing partial functionality instead of errors.

→

Allows time for recovery or repair without disrupting all users.

✗ Cons

→

Requires additional development effort to implement fallback logic and detect failures.

→

May serve stale or incomplete data, which can confuse users or cause inconsistencies.

→

Complexity increases as fallback paths must be maintained and tested.

Use when your microservices handle critical user-facing features and you expect occasional failures or latency spikes, especially at scale above thousands of requests per second.

Avoid if your system requires strict consistency and cannot tolerate stale or partial data, or if your traffic is very low (under 100 requests per second) where simpler retry logic suffices.

Real World Examples

Netflix

Netflix uses graceful degradation to serve video streaming with reduced quality or cached metadata when some backend services are slow or down, ensuring continuous playback.

Amazon

Amazon degrades non-essential features like recommendations or reviews during high load or failures, so customers can still complete purchases.

Uber

Uber degrades map features or surge pricing calculations temporarily when related microservices fail, allowing ride requests to continue.

Code Example

The before code calls the profile service directly and fails if the service is down. The after code adds a fallback using cached data with an LRU cache decorator. If the service call fails, it returns cached data, enabling graceful degradation.

Microservices

### Before: No graceful degradation

def get_user_profile(user_id):
    # Direct call to microservice
    response = call_profile_service(user_id)
    return response.data


### After: With graceful degradation
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_cached_profile(user_id):
    # Cached fallback
    response = call_profile_service(user_id)
    return response.data

def get_user_profile(user_id):
    try:
        response = call_profile_service(user_id)
        return response.data
    except ServiceUnavailableError:
        # Fallback to cached data
        return get_cached_profile(user_id)

OutputSuccess

Alternatives

Circuit Breaker

Circuit breaker stops calls to failing services to prevent cascading failures but does not provide fallback content itself.

Use when: Choose circuit breaker when you want to isolate failures quickly but fallback content is not needed or handled separately.

Bulkhead

Bulkhead isolates resources per service or feature to contain failures, rather than degrading functionality.

Use when: Choose bulkhead when you want to prevent failure spread but maintain full functionality within isolated partitions.

Retry with Exponential Backoff

Retries failed requests instead of degrading functionality, aiming to recover from transient errors.

Use when: Choose retries when failures are expected to be short-lived and full functionality is critical.

Summary

Graceful degradation helps microservices keep working with limited features during failures.

It improves user experience by avoiding total outages and providing fallback content.

It requires careful design to balance availability, data freshness, and complexity.

Practice

(1/5)

1. What is the main goal of graceful degradation in microservices?

easy

A. To increase the number of microservices for better scaling

B. To immediately stop all services when one fails

C. To keep the system running with reduced functionality during failures

D. To replace microservices with a monolithic architecture

Graceful degradation in Microservices - System Design Guide

Start learning this pattern below

Practice

Solution

Step 1: Understand the concept of graceful degradation

Step 2: Identify the goal in microservices context

Final Answer:

Quick Check:

Solution

Step 1: Identify how graceful degradation handles failures

Step 2: Match the option that uses fallback

Final Answer:

Quick Check:

Solution

Step 1: Analyze the code flow when callService() fails

Step 2: Determine the returned value

Final Answer:

Quick Check:

Solution

Step 1: Understand exception handling and return statement

Step 2: Identify the error caused by calling toString() on null

Final Answer:

Quick Check:

Solution

Step 1: Understand graceful degradation for critical service failure

Step 2: Evaluate options for best graceful degradation

Final Answer:

Quick Check: