Bird
Raised Fist0
Microservicessystem_design~25 mins

Why resilience prevents cascading failures in Microservices - Design It to Understand It

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Resilient Microservices System
Design focuses on resilience mechanisms within microservices architecture to prevent cascading failures. Out of scope: detailed business logic, UI design, and deployment automation.
Functional Requirements
FR1: Prevent cascading failures when one microservice fails
FR2: Ensure system continues to operate under partial failures
FR3: Provide fast recovery and isolation of failures
FR4: Support graceful degradation of features
FR5: Monitor and alert on failure patterns
Non-Functional Requirements
NFR1: Handle up to 10,000 concurrent requests
NFR2: API response latency p99 under 300ms
NFR3: Availability target 99.9% uptime (8.77 hours downtime/year)
NFR4: Support eventual consistency where needed
NFR5: Use standard communication protocols (HTTP/gRPC)
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
API Gateway or Service Mesh for traffic control
Circuit Breakers to stop calls to failing services
Bulkheads to isolate resource usage per service
Retries with exponential backoff
Timeouts to avoid waiting indefinitely
Fallback handlers for degraded responses
Monitoring and alerting tools
Design Patterns
Circuit Breaker pattern
Bulkhead pattern
Timeout and Retry pattern
Fallback pattern
Bulkhead Isolation
Health Checks and Heartbeats
Backpressure and Rate Limiting
Reference Architecture
Client
  |
  v
API Gateway / Service Mesh
  |
  v
+-------------------+      +-------------------+      +-------------------+
| Microservice A     | ---> | Microservice B     | ---> | Microservice C     |
| (with Circuit      |      | (with Circuit      |      | (with Circuit      |
| Breaker, Bulkhead) |      | Breaker, Bulkhead) |      | Breaker, Bulkhead) |
+-------------------+      +-------------------+      +-------------------+
       |                         |                         |
       v                         v                         v
   Database A                Database B                Database C

Monitoring & Alerting System

Components
API Gateway / Service Mesh
Envoy, Istio, or NGINX
Route requests, enforce rate limits, and provide observability
Microservices
Spring Boot, Node.js, or Go services
Business logic with resilience patterns implemented
Circuit Breaker
Resilience4j, Hystrix (deprecated), or built-in libraries
Detect failing downstream services and stop calls to prevent cascading failures
Bulkhead
Thread pools, connection pools, or container resource limits
Isolate resources per service to prevent one failure from exhausting shared resources
Retries with Backoff
Custom retry logic or libraries
Retry transient failures with increasing delay to avoid overload
Timeouts
HTTP client timeouts, gRPC deadlines
Fail fast to avoid waiting indefinitely on slow or failed services
Fallback Handlers
Code-level fallback methods
Provide degraded but functional responses when services fail
Monitoring & Alerting
Prometheus, Grafana, ELK stack, PagerDuty
Track service health, detect anomalies, and alert operators
Request Flow
1. Client sends request to API Gateway.
2. API Gateway routes request to Microservice A.
3. Microservice A calls Microservice B with circuit breaker enabled.
4. If Microservice B is healthy, it processes and returns response.
5. If Microservice B is failing, circuit breaker trips and Microservice A uses fallback.
6. Bulkheads ensure resource isolation so failure in Microservice B does not exhaust Microservice A's resources.
7. Retries with backoff are attempted for transient failures.
8. Timeouts ensure calls do not hang indefinitely.
9. Monitoring system collects metrics and triggers alerts on failure patterns.
10. This prevents failure in Microservice B from cascading to Microservice A and beyond.
Database Schema
Entities: ServiceStatus (service_id, status, last_checked), CircuitBreakerState (service_id, state, failure_count, last_failure_time), RequestLog (request_id, service_id, timestamp, status, latency). Relationships: ServiceStatus tracks health per microservice; CircuitBreakerState tracks circuit breaker info per service; RequestLog records requests for monitoring and analysis.
Scaling Discussion
Bottlenecks
Circuit breaker state management under high concurrency
Resource exhaustion if bulkheads are not properly sized
Increased latency due to retries and fallbacks
Monitoring system overload with high volume metrics
API Gateway becoming a single point of failure
Solutions
Use distributed circuit breaker implementations with consistent state sharing
Dynamically adjust bulkhead sizes based on load and resource availability
Limit retries and use adaptive backoff to reduce latency impact
Scale monitoring infrastructure horizontally and aggregate metrics efficiently
Deploy multiple API Gateway instances with load balancing and failover
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying failure scenarios, 20 minutes designing the resilience architecture and explaining patterns, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing key points.
Explain how cascading failures happen and why they are dangerous
Describe resilience patterns like circuit breakers and bulkheads clearly
Show understanding of trade-offs between availability and consistency
Discuss monitoring and alerting as essential for early failure detection
Address scaling challenges and practical solutions

Practice

(1/5)
1. What is the main reason resilience techniques are used in microservices architectures?
easy
A. To increase the speed of all services regardless of failures
B. To make services use less memory
C. To reduce the number of services in the system
D. To prevent one service failure from causing other services to fail

Solution

  1. Step 1: Understand the purpose of resilience

    Resilience techniques help systems handle failures without spreading the problem to other parts.
  2. Step 2: Identify the effect on cascading failures

    By isolating failures, resilience prevents one failure from causing a chain reaction in other services.
  3. Final Answer:

    To prevent one service failure from causing other services to fail -> Option D
  4. Quick Check:

    Resilience prevents cascading failures = B [OK]
Hint: Resilience stops failure spread, not just speed or size [OK]
Common Mistakes:
  • Thinking resilience only improves speed
  • Confusing resilience with reducing service count
  • Assuming resilience saves memory
2. Which of the following is a correct resilience pattern syntax in a microservice call?
easy
A. callService().retry(1000).timeout(3)
B. callService().retry(3).timeout(1000)
C. callService().timeout(3).retry(1000)
D. callService().retry(0).timeout(0)

Solution

  1. Step 1: Understand retry and timeout order

    Retries specify how many times to try again; timeout is the max wait time in milliseconds.
  2. Step 2: Check option correctness

    callService().retry(3).timeout(1000) uses retry(3) and timeout(1000) correctly. Others mix values or use zero which disables resilience.
  3. Final Answer:

    callService().retry(3).timeout(1000) -> Option B
  4. Quick Check:

    Correct retry and timeout syntax = C [OK]
Hint: Retry count is small integer; timeout is milliseconds [OK]
Common Mistakes:
  • Swapping retry and timeout values
  • Using zero disables resilience
  • Confusing units of timeout
3. Consider this pseudocode snippet for a microservice call with resilience:
response = callService().retry(2).timeout(500).execute()
If the service fails twice quickly and then succeeds on the third try, what will be the outcome?
medium
A. The call succeeds after two retries within timeout
B. The call never retries and returns failure
C. The call times out before any retry
D. The call fails immediately without retries

Solution

  1. Step 1: Analyze retry behavior

    Retry(2) means the system will try up to 3 times total (1 initial + 2 retries) if failures occur.
  2. Step 2: Consider timeout and success timing

    Timeout(500) means each try waits up to 500ms. If the third try succeeds within this time, the call succeeds.
  3. Final Answer:

    The call succeeds after two retries within timeout -> Option A
  4. Quick Check:

    Retries allow success after failures = D [OK]
Hint: Retries add attempts; timeout limits each try duration [OK]
Common Mistakes:
  • Assuming no retries happen
  • Confusing total timeout with per-try timeout
  • Thinking timeout cancels retries immediately
4. A microservice uses a circuit breaker to prevent cascading failures. The circuit breaker is set to open after 5 failures but it opens after only 2 failures. What is the likely cause?
medium
A. The failure count threshold is incorrectly configured
B. The circuit breaker is ignoring failures
C. The service is not failing at all
D. The circuit breaker is disabled

Solution

  1. Step 1: Understand circuit breaker failure threshold

    The circuit breaker opens after a configured number of failures to stop calls temporarily.
  2. Step 2: Analyze early opening

    If it opens after 2 failures instead of 5, the threshold setting is likely wrong or misread.
  3. Final Answer:

    The failure count threshold is incorrectly configured -> Option A
  4. Quick Check:

    Early circuit breaker open = A [OK]
Hint: Check config values when behavior differs from expectations [OK]
Common Mistakes:
  • Assuming circuit breaker ignores failures
  • Thinking service is healthy when breaker opens
  • Believing circuit breaker is disabled if it opens
5. You design a microservices system with multiple dependent services. To prevent cascading failures, which combination of resilience patterns is best to apply?
hard
A. No retries, no timeouts, and no bulkheads
B. Retries with long timeouts and no circuit breakers
C. Circuit breakers, bulkheads, and short timeouts
D. Retries with infinite timeout and no bulkheads

Solution

  1. Step 1: Identify resilience patterns that isolate failures

    Circuit breakers stop calls to failing services; bulkheads isolate failures to parts of the system; short timeouts prevent long waits.
  2. Step 2: Evaluate options for preventing cascading failures

    Circuit breakers, bulkheads, and short timeouts combines these patterns effectively to keep the system stable and responsive.
  3. Final Answer:

    Circuit breakers, bulkheads, and short timeouts -> Option C
  4. Quick Check:

    Best resilience combo isolates and limits failure impact = A [OK]
Hint: Use circuit breakers + bulkheads + short timeouts to isolate failures [OK]
Common Mistakes:
  • Using long or infinite timeouts causing delays
  • Skipping circuit breakers leading to cascading failures
  • Not isolating failures with bulkheads