Microservicessystem_design~25 mins

Graceful degradation in Microservices - System Design Exercise

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Design: Microservices System with Graceful Degradation

Design focuses on how microservices handle partial failures and degrade gracefully. It excludes detailed implementation of each microservice business logic.

Functional Requirements

FR1: The system should continue to operate with reduced functionality when some services fail.

FR2: Users should receive meaningful responses even if some features are temporarily unavailable.

FR3: The system must detect failures quickly and switch to fallback modes.

FR4: Critical services must have higher availability and degrade less.

FR5: The system should log degraded states for monitoring and alerting.

Non-Functional Requirements

NFR1: Handle up to 50,000 concurrent users.

NFR2: API response latency p99 should be under 300ms under normal conditions.

NFR3: System availability target is 99.9% uptime.

NFR4: Degraded mode responses should not exceed 500ms latency.

NFR5: Fallback mechanisms must not cause cascading failures.

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

Key Components

API Gateway with fallback routing

Service registry and health checks

Circuit breakers and bulkheads

Cache layers for fallback data

Monitoring and alerting systems

Design Patterns

Circuit Breaker pattern

Bulkhead isolation

Fallback and default responses

Timeouts and retries

Health check and service discovery

Reference Architecture

          +-------------------+
          |    API Gateway    |
          |  (with fallback)  |
          +---------+---------+
                    |
    +---------------+----------------+
    |               |                |
+---v---+       +---v---+        +---v---+
|Service|       |Service|        |Service|
|  A    |       |  B    |        |  C    |
+---+---+       +---+---+        +---+---+
    |               |                |
+---v---+       +---v---+        +---v---+
| Cache |       | Cache |        | Cache |
+-------+       +-------+        +-------+

Legend:
- API Gateway routes requests and applies fallback logic.
- Each service has its own cache for fallback data.
- Circuit breakers protect services from cascading failures.
- Monitoring tracks service health and degradation states.

Components

API Gateway

Nginx or Kong

Routes requests to microservices and provides fallback responses when services are down.

Microservices (A, B, C)

Spring Boot / Node.js / Go

Provide business functionality; designed to fail independently.

Cache Layer

Redis or Memcached

Stores recent successful responses to serve as fallback data during service failures.

Circuit Breaker

Resilience4j / Hystrix

Detects failures and stops calls to failing services temporarily to prevent cascading failures.

Service Registry and Health Checks

Consul / Eureka

Tracks service availability and health status for routing and monitoring.

Monitoring and Alerting

Prometheus + Grafana + Alertmanager

Monitors service health and degradation states; alerts on failures.

Request Flow

1. 1. Client sends request to API Gateway.

2. 2. API Gateway checks service registry for healthy services.

3. 3. API Gateway routes request to target microservice.

4. 4. Microservice processes request and returns response.

5. 5. Response is cached for fallback use.

6. 6. If microservice is down or slow, circuit breaker trips.

7. 7. API Gateway serves fallback response from cache or default message.

8. 8. Monitoring system records degraded state and alerts if needed.

Database Schema

Entities: - ServiceStatus: service_id (PK), status (healthy, degraded, down), last_checked_timestamp - CachedResponse: service_id (FK), endpoint, response_data, timestamp Relationships: - ServiceStatus tracks health per microservice. - CachedResponse stores fallback data linked to services and endpoints.

Scaling Discussion

Bottlenecks

API Gateway becomes a single point of failure or bottleneck under high load.

Cache layer may become stale or overwhelmed with fallback data.

Circuit breaker thresholds may not adapt well to changing load patterns.

Monitoring system may generate too many alerts during widespread degradation.

Solutions

Deploy multiple API Gateway instances behind a load balancer for high availability.

Use distributed caching with eviction policies and TTL to keep fallback data fresh.

Implement adaptive circuit breaker settings based on real-time metrics.

Use alert aggregation and severity levels to reduce alert noise.

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain importance of graceful degradation for user experience and system resilience.

Describe how circuit breakers and fallback caches prevent cascading failures.

Discuss trade-offs between consistency and availability during degradation.

Highlight monitoring and alerting to detect and respond to degraded states.

Show awareness of scaling challenges and mitigation strategies.

Practice

(1/5)

1. What is the main goal of graceful degradation in microservices?

easy

A. To increase the number of microservices for better scaling

B. To immediately stop all services when one fails

C. To keep the system running with reduced functionality during failures

D. To replace microservices with a monolithic architecture

Graceful degradation in Microservices - System Design Exercise

Start learning this pattern below

Practice

Solution

Step 1: Understand the concept of graceful degradation

Step 2: Identify the goal in microservices context

Final Answer:

Quick Check:

Solution

Step 1: Identify how graceful degradation handles failures

Step 2: Match the option that uses fallback

Final Answer:

Quick Check:

Solution

Step 1: Analyze the code flow when callService() fails

Step 2: Determine the returned value

Final Answer:

Quick Check:

Solution

Step 1: Understand exception handling and return statement

Step 2: Identify the error caused by calling toString() on null

Final Answer:

Quick Check:

Solution

Step 1: Understand graceful degradation for critical service failure

Step 2: Evaluate options for best graceful degradation

Final Answer:

Quick Check: