Bird
Raised Fist0
Microservicessystem_design~12 mins

Why resilience prevents cascading failures in Microservices - Architecture Impact

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
System Overview - Why resilience prevents cascading failures

This system demonstrates how resilience techniques in a microservices architecture prevent cascading failures. It shows how components like circuit breakers, retries, and fallback services help isolate failures and keep the system stable under stress.

Architecture Diagram
User
  |
  v
Load Balancer
  |
  v
API Gateway
  |
  +-------------------------+
  |                         |
  v                         v
Service A (with Circuit Breaker)   Service B (with Retry & Fallback)
  |                         |
  v                         v
Database A                Database B
  |
  v
Cache
Components
User
user
Initiates requests to the system
Load Balancer
load_balancer
Distributes incoming requests evenly to API Gateway instances
API Gateway
api_gateway
Routes requests to appropriate microservices and enforces resilience policies
Service A (with Circuit Breaker)
service
Handles business logic with circuit breaker to stop calls to failing downstream services
Service B (with Retry & Fallback)
service
Handles business logic with retry attempts and fallback responses on failure
Database A
database
Stores persistent data for Service A
Database B
database
Stores persistent data for Service B
Cache
cache
Speeds up data access and reduces load on databases
Fallback Service
service
Provides fallback responses when Service B fails
Request Flow - 13 Hops
UserLoad Balancer
Load BalancerAPI Gateway
API GatewayService A (with Circuit Breaker)
Service A (with Circuit Breaker)Database A
Database ACache
CacheService A (with Circuit Breaker)
Service A (with Circuit Breaker)API Gateway
API GatewayService B (with Retry & Fallback)
Service B (with Retry & Fallback)Database B
Service B (with Retry & Fallback)Fallback Service
Service B (with Retry & Fallback)API Gateway
API GatewayLoad Balancer
Load BalancerUser
Failure Scenario
Component Fails:Database B
Impact:Service B's database queries fail causing retries and eventual fallback responses. Without resilience, this failure could overload Service B and API Gateway, causing cascading failures.
Mitigation:Retry logic limits repeated attempts, fallback service provides default responses, circuit breakers prevent overload, isolating failure and maintaining system stability.
Architecture Quiz - 3 Questions
Test your understanding
Which component prevents Service A from repeatedly calling a failing database?
ACache
BLoad Balancer
CCircuit Breaker in Service A
DAPI Gateway
Design Principle
This architecture uses resilience patterns like circuit breakers, retries, and fallbacks to isolate failures and prevent them from spreading. Caches reduce load on databases, further stabilizing the system. These techniques together stop one failure from causing a chain reaction, keeping the system responsive and reliable.

Practice

(1/5)
1. What is the main reason resilience techniques are used in microservices architectures?
easy
A. To increase the speed of all services regardless of failures
B. To make services use less memory
C. To reduce the number of services in the system
D. To prevent one service failure from causing other services to fail

Solution

  1. Step 1: Understand the purpose of resilience

    Resilience techniques help systems handle failures without spreading the problem to other parts.
  2. Step 2: Identify the effect on cascading failures

    By isolating failures, resilience prevents one failure from causing a chain reaction in other services.
  3. Final Answer:

    To prevent one service failure from causing other services to fail -> Option D
  4. Quick Check:

    Resilience prevents cascading failures = B [OK]
Hint: Resilience stops failure spread, not just speed or size [OK]
Common Mistakes:
  • Thinking resilience only improves speed
  • Confusing resilience with reducing service count
  • Assuming resilience saves memory
2. Which of the following is a correct resilience pattern syntax in a microservice call?
easy
A. callService().retry(1000).timeout(3)
B. callService().retry(3).timeout(1000)
C. callService().timeout(3).retry(1000)
D. callService().retry(0).timeout(0)

Solution

  1. Step 1: Understand retry and timeout order

    Retries specify how many times to try again; timeout is the max wait time in milliseconds.
  2. Step 2: Check option correctness

    callService().retry(3).timeout(1000) uses retry(3) and timeout(1000) correctly. Others mix values or use zero which disables resilience.
  3. Final Answer:

    callService().retry(3).timeout(1000) -> Option B
  4. Quick Check:

    Correct retry and timeout syntax = C [OK]
Hint: Retry count is small integer; timeout is milliseconds [OK]
Common Mistakes:
  • Swapping retry and timeout values
  • Using zero disables resilience
  • Confusing units of timeout
3. Consider this pseudocode snippet for a microservice call with resilience:
response = callService().retry(2).timeout(500).execute()
If the service fails twice quickly and then succeeds on the third try, what will be the outcome?
medium
A. The call succeeds after two retries within timeout
B. The call never retries and returns failure
C. The call times out before any retry
D. The call fails immediately without retries

Solution

  1. Step 1: Analyze retry behavior

    Retry(2) means the system will try up to 3 times total (1 initial + 2 retries) if failures occur.
  2. Step 2: Consider timeout and success timing

    Timeout(500) means each try waits up to 500ms. If the third try succeeds within this time, the call succeeds.
  3. Final Answer:

    The call succeeds after two retries within timeout -> Option A
  4. Quick Check:

    Retries allow success after failures = D [OK]
Hint: Retries add attempts; timeout limits each try duration [OK]
Common Mistakes:
  • Assuming no retries happen
  • Confusing total timeout with per-try timeout
  • Thinking timeout cancels retries immediately
4. A microservice uses a circuit breaker to prevent cascading failures. The circuit breaker is set to open after 5 failures but it opens after only 2 failures. What is the likely cause?
medium
A. The failure count threshold is incorrectly configured
B. The circuit breaker is ignoring failures
C. The service is not failing at all
D. The circuit breaker is disabled

Solution

  1. Step 1: Understand circuit breaker failure threshold

    The circuit breaker opens after a configured number of failures to stop calls temporarily.
  2. Step 2: Analyze early opening

    If it opens after 2 failures instead of 5, the threshold setting is likely wrong or misread.
  3. Final Answer:

    The failure count threshold is incorrectly configured -> Option A
  4. Quick Check:

    Early circuit breaker open = A [OK]
Hint: Check config values when behavior differs from expectations [OK]
Common Mistakes:
  • Assuming circuit breaker ignores failures
  • Thinking service is healthy when breaker opens
  • Believing circuit breaker is disabled if it opens
5. You design a microservices system with multiple dependent services. To prevent cascading failures, which combination of resilience patterns is best to apply?
hard
A. No retries, no timeouts, and no bulkheads
B. Retries with long timeouts and no circuit breakers
C. Circuit breakers, bulkheads, and short timeouts
D. Retries with infinite timeout and no bulkheads

Solution

  1. Step 1: Identify resilience patterns that isolate failures

    Circuit breakers stop calls to failing services; bulkheads isolate failures to parts of the system; short timeouts prevent long waits.
  2. Step 2: Evaluate options for preventing cascading failures

    Circuit breakers, bulkheads, and short timeouts combines these patterns effectively to keep the system stable and responsive.
  3. Final Answer:

    Circuit breakers, bulkheads, and short timeouts -> Option C
  4. Quick Check:

    Best resilience combo isolates and limits failure impact = A [OK]
Hint: Use circuit breakers + bulkheads + short timeouts to isolate failures [OK]
Common Mistakes:
  • Using long or infinite timeouts causing delays
  • Skipping circuit breakers leading to cascading failures
  • Not isolating failures with bulkheads