Bird
Raised Fist0
Microservicessystem_design~25 mins

Graceful degradation in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Microservices System with Graceful Degradation
Design focuses on how microservices handle partial failures and degrade gracefully. It excludes detailed implementation of each microservice business logic.
Functional Requirements
FR1: The system should continue to operate with reduced functionality when some services fail.
FR2: Users should receive meaningful responses even if some features are temporarily unavailable.
FR3: The system must detect failures quickly and switch to fallback modes.
FR4: Critical services must have higher availability and degrade less.
FR5: The system should log degraded states for monitoring and alerting.
Non-Functional Requirements
NFR1: Handle up to 50,000 concurrent users.
NFR2: API response latency p99 should be under 300ms under normal conditions.
NFR3: System availability target is 99.9% uptime.
NFR4: Degraded mode responses should not exceed 500ms latency.
NFR5: Fallback mechanisms must not cause cascading failures.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
API Gateway with fallback routing
Service registry and health checks
Circuit breakers and bulkheads
Cache layers for fallback data
Monitoring and alerting systems
Design Patterns
Circuit Breaker pattern
Bulkhead isolation
Fallback and default responses
Timeouts and retries
Health check and service discovery
Reference Architecture
          +-------------------+
          |    API Gateway    |
          |  (with fallback)  |
          +---------+---------+
                    |
    +---------------+----------------+
    |               |                |
+---v---+       +---v---+        +---v---+
|Service|       |Service|        |Service|
|  A    |       |  B    |        |  C    |
+---+---+       +---+---+        +---+---+
    |               |                |
+---v---+       +---v---+        +---v---+
| Cache |       | Cache |        | Cache |
+-------+       +-------+        +-------+

Legend:
- API Gateway routes requests and applies fallback logic.
- Each service has its own cache for fallback data.
- Circuit breakers protect services from cascading failures.
- Monitoring tracks service health and degradation states.
Components
API Gateway
Nginx or Kong
Routes requests to microservices and provides fallback responses when services are down.
Microservices (A, B, C)
Spring Boot / Node.js / Go
Provide business functionality; designed to fail independently.
Cache Layer
Redis or Memcached
Stores recent successful responses to serve as fallback data during service failures.
Circuit Breaker
Resilience4j / Hystrix
Detects failures and stops calls to failing services temporarily to prevent cascading failures.
Service Registry and Health Checks
Consul / Eureka
Tracks service availability and health status for routing and monitoring.
Monitoring and Alerting
Prometheus + Grafana + Alertmanager
Monitors service health and degradation states; alerts on failures.
Request Flow
1. 1. Client sends request to API Gateway.
2. 2. API Gateway checks service registry for healthy services.
3. 3. API Gateway routes request to target microservice.
4. 4. Microservice processes request and returns response.
5. 5. Response is cached for fallback use.
6. 6. If microservice is down or slow, circuit breaker trips.
7. 7. API Gateway serves fallback response from cache or default message.
8. 8. Monitoring system records degraded state and alerts if needed.
Database Schema
Entities: - ServiceStatus: service_id (PK), status (healthy, degraded, down), last_checked_timestamp - CachedResponse: service_id (FK), endpoint, response_data, timestamp Relationships: - ServiceStatus tracks health per microservice. - CachedResponse stores fallback data linked to services and endpoints.
Scaling Discussion
Bottlenecks
API Gateway becomes a single point of failure or bottleneck under high load.
Cache layer may become stale or overwhelmed with fallback data.
Circuit breaker thresholds may not adapt well to changing load patterns.
Monitoring system may generate too many alerts during widespread degradation.
Solutions
Deploy multiple API Gateway instances behind a load balancer for high availability.
Use distributed caching with eviction policies and TTL to keep fallback data fresh.
Implement adaptive circuit breaker settings based on real-time metrics.
Use alert aggregation and severity levels to reduce alert noise.
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain importance of graceful degradation for user experience and system resilience.
Describe how circuit breakers and fallback caches prevent cascading failures.
Discuss trade-offs between consistency and availability during degradation.
Highlight monitoring and alerting to detect and respond to degraded states.
Show awareness of scaling challenges and mitigation strategies.

Practice

(1/5)
1. What is the main goal of graceful degradation in microservices?
easy
A. To increase the number of microservices for better scaling
B. To immediately stop all services when one fails
C. To keep the system running with reduced functionality during failures
D. To replace microservices with a monolithic architecture

Solution

  1. Step 1: Understand the concept of graceful degradation

    Graceful degradation means the system continues to work even if some parts fail, but with limited features.
  2. Step 2: Identify the goal in microservices context

    In microservices, it ensures users still get responses, possibly simpler or fallback, instead of total failure.
  3. Final Answer:

    To keep the system running with reduced functionality during failures -> Option C
  4. Quick Check:

    Graceful degradation = reduced functionality during failure [OK]
Hint: Graceful degradation means partial working, not full stop [OK]
Common Mistakes:
  • Thinking graceful degradation means full system shutdown
  • Confusing graceful degradation with scaling techniques
  • Assuming it replaces microservices with monolith
2. Which of the following is a correct way to implement graceful degradation in a microservice call?
easy
A. Restart the entire microservice cluster immediately
B. Return an error and stop the entire request flow
C. Ignore the failure and return no response
D. Use a fallback response when the called service is unavailable

Solution

  1. Step 1: Identify how graceful degradation handles failures

    It uses fallback responses or simpler data to keep the system responsive.
  2. Step 2: Match the option that uses fallback

    Use a fallback response when the called service is unavailable describes using fallback response when a service is down, which is correct.
  3. Final Answer:

    Use a fallback response when the called service is unavailable -> Option D
  4. Quick Check:

    Fallback response = graceful degradation [OK]
Hint: Fallback response is key to graceful degradation [OK]
Common Mistakes:
  • Stopping entire request instead of fallback
  • Ignoring failure without response
  • Restarting cluster is not graceful degradation
3. Consider this pseudocode for a microservice call with graceful degradation:
response = callService()
if response == null:
    response = getCachedData()
return response

What will be returned if callService() fails?
medium
A. Cached data as fallback
B. Null value
C. An error message
D. Empty string

Solution

  1. Step 1: Analyze the code flow when callService() fails

    If callService() returns null (failure), the code fetches cached data as fallback.
  2. Step 2: Determine the returned value

    The fallback cached data is returned instead of null or error.
  3. Final Answer:

    Cached data as fallback -> Option A
  4. Quick Check:

    Fallback cached data returned on failure [OK]
Hint: Null response triggers fallback to cached data [OK]
Common Mistakes:
  • Assuming error message is returned
  • Thinking null is returned directly
  • Confusing empty string with fallback data
4. A microservice uses this code snippet for graceful degradation:
try {
  data = fetchFromService()
} catch (Exception e) {
  data = null
}
return data.toString()

What is the main problem with this code?
medium
A. It does not handle exceptions properly
B. It returns null.toString() causing a runtime error
C. It always returns an empty string
D. It retries the service call infinitely

Solution

  1. Step 1: Understand exception handling and return statement

    If fetchFromService() fails, data is set to null, then data.toString() is called.
  2. Step 2: Identify the error caused by calling toString() on null

    Calling toString() on null causes a runtime NullPointerException or similar error.
  3. Final Answer:

    It returns null.toString() causing a runtime error -> Option B
  4. Quick Check:

    Calling toString() on null causes error [OK]
Hint: Calling method on null causes runtime error [OK]
Common Mistakes:
  • Ignoring null check before toString()
  • Assuming exception is handled fully
  • Thinking it retries infinitely
5. You design a microservice system where the payment service may fail. To apply graceful degradation, which approach is best?
hard
A. Return a simplified confirmation without payment details and log failure for retry
B. Block the entire order process until payment service recovers
C. Send an error response to the user immediately without fallback
D. Remove the payment service and process orders without payment

Solution

  1. Step 1: Understand graceful degradation for critical service failure

    When payment service fails, system should still respond with limited info, not block or error out.
  2. Step 2: Evaluate options for best graceful degradation

    Return a simplified confirmation without payment details and log failure for retry returns simplified confirmation and logs failure for retry, maintaining user experience and system reliability.
  3. Final Answer:

    Return a simplified confirmation without payment details and log failure for retry -> Option A
  4. Quick Check:

    Simplified response + retry = graceful degradation [OK]
Hint: Simplify response and log failure for retry [OK]
Common Mistakes:
  • Blocking entire process on failure
  • Sending immediate error without fallback
  • Removing critical service entirely