Bird
Raised Fist0
Microservicessystem_design~7 mins

Why resilience prevents cascading failures in Microservices - Why This Architecture

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
When one microservice fails or slows down, it can cause other connected services to also fail or become slow, creating a chain reaction that brings down the entire system. This cascading failure happens because services keep waiting for responses or retrying endlessly, exhausting resources and causing widespread outages.
Solution
Resilience techniques add safeguards like timeouts, retries with limits, and fallback responses to each service. These controls stop failures from spreading by quickly detecting problems and isolating them, so one service's trouble doesn't overwhelm others. This keeps the system stable and responsive even when parts fail.
Architecture
Service A
Service B
Timeout &
Retry Logic

This diagram shows three microservices calling each other in sequence, each protected by resilience mechanisms like timeouts, circuit breakers, and fallback responses to prevent failure from spreading.

Trade-offs
✓ Pros
Prevents one service failure from causing system-wide outages.
Improves overall system availability and user experience.
Allows graceful degradation by providing fallback responses.
Detects failures quickly to reduce wasted resources on retries.
✗ Cons
Adds complexity to service code and configuration.
May mask underlying problems if fallbacks are overused.
Requires careful tuning of timeouts and retry policies.
Use when your system has multiple dependent microservices with inter-service calls and you expect partial failures or network issues at scale above hundreds of requests per second.
Avoid if your system is a simple monolith or has very low traffic (under 100 requests per second) where the overhead of resilience mechanisms outweighs benefits.
Real World Examples
Netflix
Netflix uses circuit breakers and fallback responses in its microservices to prevent cascading failures during high traffic spikes or partial outages, ensuring continuous streaming.
Amazon
Amazon applies resilience patterns like retries with exponential backoff and circuit breakers to isolate failures in its order processing microservices, preventing system-wide order delays.
Uber
Uber implements timeouts and fallback logic in its ride matching services to avoid cascading failures when some services become slow or unresponsive during peak demand.
Code Example
The before code calls another service without limits, risking long waits or failures spreading. The after code adds a timeout to stop waiting too long and a circuit breaker to stop calling the failing service after repeated errors, returning a fallback response instead.
Microservices
### Before resilience (naive call without timeout or circuit breaker)
import requests

def call_service_b():
    response = requests.get('http://service-b/api/data')
    return response.json()


### After applying resilience with timeout and circuit breaker
import requests
from circuitbreaker import circuit

@circuit(failure_threshold=3, recovery_timeout=10)
def call_service_b():
    try:
        response = requests.get('http://service-b/api/data', timeout=2)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException:
        return {'data': 'fallback response'}
OutputSuccess
Alternatives
Bulkhead Isolation
Divides system resources into isolated pools to contain failures, rather than relying on timeouts and retries.
Use when: Choose when resource exhaustion is a major risk and you want to prevent one service from consuming all resources.
Load Shedding
Drops or rejects requests under high load to prevent overload, instead of trying to handle all requests with resilience.
Use when: Choose when protecting system stability under extreme load is more important than serving every request.
Summary
Cascading failures happen when one service's problem spreads to others, causing system-wide outages.
Resilience techniques like timeouts, retries, circuit breakers, and fallbacks isolate failures and keep the system stable.
Applying resilience is essential in microservices to maintain availability and prevent widespread failures.

Practice

(1/5)
1. What is the main reason resilience techniques are used in microservices architectures?
easy
A. To increase the speed of all services regardless of failures
B. To make services use less memory
C. To reduce the number of services in the system
D. To prevent one service failure from causing other services to fail

Solution

  1. Step 1: Understand the purpose of resilience

    Resilience techniques help systems handle failures without spreading the problem to other parts.
  2. Step 2: Identify the effect on cascading failures

    By isolating failures, resilience prevents one failure from causing a chain reaction in other services.
  3. Final Answer:

    To prevent one service failure from causing other services to fail -> Option D
  4. Quick Check:

    Resilience prevents cascading failures = B [OK]
Hint: Resilience stops failure spread, not just speed or size [OK]
Common Mistakes:
  • Thinking resilience only improves speed
  • Confusing resilience with reducing service count
  • Assuming resilience saves memory
2. Which of the following is a correct resilience pattern syntax in a microservice call?
easy
A. callService().retry(1000).timeout(3)
B. callService().retry(3).timeout(1000)
C. callService().timeout(3).retry(1000)
D. callService().retry(0).timeout(0)

Solution

  1. Step 1: Understand retry and timeout order

    Retries specify how many times to try again; timeout is the max wait time in milliseconds.
  2. Step 2: Check option correctness

    callService().retry(3).timeout(1000) uses retry(3) and timeout(1000) correctly. Others mix values or use zero which disables resilience.
  3. Final Answer:

    callService().retry(3).timeout(1000) -> Option B
  4. Quick Check:

    Correct retry and timeout syntax = C [OK]
Hint: Retry count is small integer; timeout is milliseconds [OK]
Common Mistakes:
  • Swapping retry and timeout values
  • Using zero disables resilience
  • Confusing units of timeout
3. Consider this pseudocode snippet for a microservice call with resilience:
response = callService().retry(2).timeout(500).execute()
If the service fails twice quickly and then succeeds on the third try, what will be the outcome?
medium
A. The call succeeds after two retries within timeout
B. The call never retries and returns failure
C. The call times out before any retry
D. The call fails immediately without retries

Solution

  1. Step 1: Analyze retry behavior

    Retry(2) means the system will try up to 3 times total (1 initial + 2 retries) if failures occur.
  2. Step 2: Consider timeout and success timing

    Timeout(500) means each try waits up to 500ms. If the third try succeeds within this time, the call succeeds.
  3. Final Answer:

    The call succeeds after two retries within timeout -> Option A
  4. Quick Check:

    Retries allow success after failures = D [OK]
Hint: Retries add attempts; timeout limits each try duration [OK]
Common Mistakes:
  • Assuming no retries happen
  • Confusing total timeout with per-try timeout
  • Thinking timeout cancels retries immediately
4. A microservice uses a circuit breaker to prevent cascading failures. The circuit breaker is set to open after 5 failures but it opens after only 2 failures. What is the likely cause?
medium
A. The failure count threshold is incorrectly configured
B. The circuit breaker is ignoring failures
C. The service is not failing at all
D. The circuit breaker is disabled

Solution

  1. Step 1: Understand circuit breaker failure threshold

    The circuit breaker opens after a configured number of failures to stop calls temporarily.
  2. Step 2: Analyze early opening

    If it opens after 2 failures instead of 5, the threshold setting is likely wrong or misread.
  3. Final Answer:

    The failure count threshold is incorrectly configured -> Option A
  4. Quick Check:

    Early circuit breaker open = A [OK]
Hint: Check config values when behavior differs from expectations [OK]
Common Mistakes:
  • Assuming circuit breaker ignores failures
  • Thinking service is healthy when breaker opens
  • Believing circuit breaker is disabled if it opens
5. You design a microservices system with multiple dependent services. To prevent cascading failures, which combination of resilience patterns is best to apply?
hard
A. No retries, no timeouts, and no bulkheads
B. Retries with long timeouts and no circuit breakers
C. Circuit breakers, bulkheads, and short timeouts
D. Retries with infinite timeout and no bulkheads

Solution

  1. Step 1: Identify resilience patterns that isolate failures

    Circuit breakers stop calls to failing services; bulkheads isolate failures to parts of the system; short timeouts prevent long waits.
  2. Step 2: Evaluate options for preventing cascading failures

    Circuit breakers, bulkheads, and short timeouts combines these patterns effectively to keep the system stable and responsive.
  3. Final Answer:

    Circuit breakers, bulkheads, and short timeouts -> Option C
  4. Quick Check:

    Best resilience combo isolates and limits failure impact = A [OK]
Hint: Use circuit breakers + bulkheads + short timeouts to isolate failures [OK]
Common Mistakes:
  • Using long or infinite timeouts causing delays
  • Skipping circuit breakers leading to cascading failures
  • Not isolating failures with bulkheads