0
0
Microservicessystem_design~25 mins

Graceful degradation in Microservices - System Design Exercise

Choose your learning style9 modes available
Design: Microservices System with Graceful Degradation
Design focuses on how microservices handle partial failures and degrade gracefully. It excludes detailed implementation of each microservice business logic.
Functional Requirements
FR1: The system should continue to operate with reduced functionality when some services fail.
FR2: Users should receive meaningful responses even if some features are temporarily unavailable.
FR3: The system must detect failures quickly and switch to fallback modes.
FR4: Critical services must have higher availability and degrade less.
FR5: The system should log degraded states for monitoring and alerting.
Non-Functional Requirements
NFR1: Handle up to 50,000 concurrent users.
NFR2: API response latency p99 should be under 300ms under normal conditions.
NFR3: System availability target is 99.9% uptime.
NFR4: Degraded mode responses should not exceed 500ms latency.
NFR5: Fallback mechanisms must not cause cascading failures.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
API Gateway with fallback routing
Service registry and health checks
Circuit breakers and bulkheads
Cache layers for fallback data
Monitoring and alerting systems
Design Patterns
Circuit Breaker pattern
Bulkhead isolation
Fallback and default responses
Timeouts and retries
Health check and service discovery
Reference Architecture
          +-------------------+
          |    API Gateway    |
          |  (with fallback)  |
          +---------+---------+
                    |
    +---------------+----------------+
    |               |                |
+---v---+       +---v---+        +---v---+
|Service|       |Service|        |Service|
|  A    |       |  B    |        |  C    |
+---+---+       +---+---+        +---+---+
    |               |                |
+---v---+       +---v---+        +---v---+
| Cache |       | Cache |        | Cache |
+-------+       +-------+        +-------+

Legend:
- API Gateway routes requests and applies fallback logic.
- Each service has its own cache for fallback data.
- Circuit breakers protect services from cascading failures.
- Monitoring tracks service health and degradation states.
Components
API Gateway
Nginx or Kong
Routes requests to microservices and provides fallback responses when services are down.
Microservices (A, B, C)
Spring Boot / Node.js / Go
Provide business functionality; designed to fail independently.
Cache Layer
Redis or Memcached
Stores recent successful responses to serve as fallback data during service failures.
Circuit Breaker
Resilience4j / Hystrix
Detects failures and stops calls to failing services temporarily to prevent cascading failures.
Service Registry and Health Checks
Consul / Eureka
Tracks service availability and health status for routing and monitoring.
Monitoring and Alerting
Prometheus + Grafana + Alertmanager
Monitors service health and degradation states; alerts on failures.
Request Flow
1. 1. Client sends request to API Gateway.
2. 2. API Gateway checks service registry for healthy services.
3. 3. API Gateway routes request to target microservice.
4. 4. Microservice processes request and returns response.
5. 5. Response is cached for fallback use.
6. 6. If microservice is down or slow, circuit breaker trips.
7. 7. API Gateway serves fallback response from cache or default message.
8. 8. Monitoring system records degraded state and alerts if needed.
Database Schema
Entities: - ServiceStatus: service_id (PK), status (healthy, degraded, down), last_checked_timestamp - CachedResponse: service_id (FK), endpoint, response_data, timestamp Relationships: - ServiceStatus tracks health per microservice. - CachedResponse stores fallback data linked to services and endpoints.
Scaling Discussion
Bottlenecks
API Gateway becomes a single point of failure or bottleneck under high load.
Cache layer may become stale or overwhelmed with fallback data.
Circuit breaker thresholds may not adapt well to changing load patterns.
Monitoring system may generate too many alerts during widespread degradation.
Solutions
Deploy multiple API Gateway instances behind a load balancer for high availability.
Use distributed caching with eviction policies and TTL to keep fallback data fresh.
Implement adaptive circuit breaker settings based on real-time metrics.
Use alert aggregation and severity levels to reduce alert noise.
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain importance of graceful degradation for user experience and system resilience.
Describe how circuit breakers and fallback caches prevent cascading failures.
Discuss trade-offs between consistency and availability during degradation.
Highlight monitoring and alerting to detect and respond to degraded states.
Show awareness of scaling challenges and mitigation strategies.