Bird
Raised Fist0
Microservicessystem_design~7 mins

Graceful degradation in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
When a microservice or its dependencies fail or become slow, the entire system can become unresponsive or crash, causing a poor user experience and potential data loss.
Solution
Graceful degradation allows the system to continue operating with reduced functionality instead of failing completely. It detects failures or slowdowns and switches to fallback behaviors, such as serving cached data or disabling non-critical features, to maintain core service availability.
Architecture
Client
API Gateway
Cache / Circuit
Cache / Circuit

This diagram shows a client request flowing through an API Gateway to a microservice. If the microservice fails or is slow, the fallback mechanism activates, possibly using cache or circuit breaker to provide degraded but available responses.

Trade-offs
✓ Pros
Improves system availability by preventing total failure during partial outages.
Enhances user experience by providing partial functionality instead of errors.
Allows time for recovery or repair without disrupting all users.
✗ Cons
Requires additional development effort to implement fallback logic and detect failures.
May serve stale or incomplete data, which can confuse users or cause inconsistencies.
Complexity increases as fallback paths must be maintained and tested.
Use when your microservices handle critical user-facing features and you expect occasional failures or latency spikes, especially at scale above thousands of requests per second.
Avoid if your system requires strict consistency and cannot tolerate stale or partial data, or if your traffic is very low (under 100 requests per second) where simpler retry logic suffices.
Real World Examples
Netflix
Netflix uses graceful degradation to serve video streaming with reduced quality or cached metadata when some backend services are slow or down, ensuring continuous playback.
Amazon
Amazon degrades non-essential features like recommendations or reviews during high load or failures, so customers can still complete purchases.
Uber
Uber degrades map features or surge pricing calculations temporarily when related microservices fail, allowing ride requests to continue.
Code Example
The before code calls the profile service directly and fails if the service is down. The after code adds a fallback using cached data with an LRU cache decorator. If the service call fails, it returns cached data, enabling graceful degradation.
Microservices
### Before: No graceful degradation

def get_user_profile(user_id):
    # Direct call to microservice
    response = call_profile_service(user_id)
    return response.data


### After: With graceful degradation
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_cached_profile(user_id):
    # Cached fallback
    response = call_profile_service(user_id)
    return response.data

def get_user_profile(user_id):
    try:
        response = call_profile_service(user_id)
        return response.data
    except ServiceUnavailableError:
        # Fallback to cached data
        return get_cached_profile(user_id)

OutputSuccess
Alternatives
Circuit Breaker
Circuit breaker stops calls to failing services to prevent cascading failures but does not provide fallback content itself.
Use when: Choose circuit breaker when you want to isolate failures quickly but fallback content is not needed or handled separately.
Bulkhead
Bulkhead isolates resources per service or feature to contain failures, rather than degrading functionality.
Use when: Choose bulkhead when you want to prevent failure spread but maintain full functionality within isolated partitions.
Retry with Exponential Backoff
Retries failed requests instead of degrading functionality, aiming to recover from transient errors.
Use when: Choose retries when failures are expected to be short-lived and full functionality is critical.
Summary
Graceful degradation helps microservices keep working with limited features during failures.
It improves user experience by avoiding total outages and providing fallback content.
It requires careful design to balance availability, data freshness, and complexity.

Practice

(1/5)
1. What is the main goal of graceful degradation in microservices?
easy
A. To increase the number of microservices for better scaling
B. To immediately stop all services when one fails
C. To keep the system running with reduced functionality during failures
D. To replace microservices with a monolithic architecture

Solution

  1. Step 1: Understand the concept of graceful degradation

    Graceful degradation means the system continues to work even if some parts fail, but with limited features.
  2. Step 2: Identify the goal in microservices context

    In microservices, it ensures users still get responses, possibly simpler or fallback, instead of total failure.
  3. Final Answer:

    To keep the system running with reduced functionality during failures -> Option C
  4. Quick Check:

    Graceful degradation = reduced functionality during failure [OK]
Hint: Graceful degradation means partial working, not full stop [OK]
Common Mistakes:
  • Thinking graceful degradation means full system shutdown
  • Confusing graceful degradation with scaling techniques
  • Assuming it replaces microservices with monolith
2. Which of the following is a correct way to implement graceful degradation in a microservice call?
easy
A. Restart the entire microservice cluster immediately
B. Return an error and stop the entire request flow
C. Ignore the failure and return no response
D. Use a fallback response when the called service is unavailable

Solution

  1. Step 1: Identify how graceful degradation handles failures

    It uses fallback responses or simpler data to keep the system responsive.
  2. Step 2: Match the option that uses fallback

    Use a fallback response when the called service is unavailable describes using fallback response when a service is down, which is correct.
  3. Final Answer:

    Use a fallback response when the called service is unavailable -> Option D
  4. Quick Check:

    Fallback response = graceful degradation [OK]
Hint: Fallback response is key to graceful degradation [OK]
Common Mistakes:
  • Stopping entire request instead of fallback
  • Ignoring failure without response
  • Restarting cluster is not graceful degradation
3. Consider this pseudocode for a microservice call with graceful degradation:
response = callService()
if response == null:
    response = getCachedData()
return response

What will be returned if callService() fails?
medium
A. Cached data as fallback
B. Null value
C. An error message
D. Empty string

Solution

  1. Step 1: Analyze the code flow when callService() fails

    If callService() returns null (failure), the code fetches cached data as fallback.
  2. Step 2: Determine the returned value

    The fallback cached data is returned instead of null or error.
  3. Final Answer:

    Cached data as fallback -> Option A
  4. Quick Check:

    Fallback cached data returned on failure [OK]
Hint: Null response triggers fallback to cached data [OK]
Common Mistakes:
  • Assuming error message is returned
  • Thinking null is returned directly
  • Confusing empty string with fallback data
4. A microservice uses this code snippet for graceful degradation:
try {
  data = fetchFromService()
} catch (Exception e) {
  data = null
}
return data.toString()

What is the main problem with this code?
medium
A. It does not handle exceptions properly
B. It returns null.toString() causing a runtime error
C. It always returns an empty string
D. It retries the service call infinitely

Solution

  1. Step 1: Understand exception handling and return statement

    If fetchFromService() fails, data is set to null, then data.toString() is called.
  2. Step 2: Identify the error caused by calling toString() on null

    Calling toString() on null causes a runtime NullPointerException or similar error.
  3. Final Answer:

    It returns null.toString() causing a runtime error -> Option B
  4. Quick Check:

    Calling toString() on null causes error [OK]
Hint: Calling method on null causes runtime error [OK]
Common Mistakes:
  • Ignoring null check before toString()
  • Assuming exception is handled fully
  • Thinking it retries infinitely
5. You design a microservice system where the payment service may fail. To apply graceful degradation, which approach is best?
hard
A. Return a simplified confirmation without payment details and log failure for retry
B. Block the entire order process until payment service recovers
C. Send an error response to the user immediately without fallback
D. Remove the payment service and process orders without payment

Solution

  1. Step 1: Understand graceful degradation for critical service failure

    When payment service fails, system should still respond with limited info, not block or error out.
  2. Step 2: Evaluate options for best graceful degradation

    Return a simplified confirmation without payment details and log failure for retry returns simplified confirmation and logs failure for retry, maintaining user experience and system reliability.
  3. Final Answer:

    Return a simplified confirmation without payment details and log failure for retry -> Option A
  4. Quick Check:

    Simplified response + retry = graceful degradation [OK]
Hint: Simplify response and log failure for retry [OK]
Common Mistakes:
  • Blocking entire process on failure
  • Sending immediate error without fallback
  • Removing critical service entirely