Microservicessystem_design~7 mins

Why resilience prevents cascading failures in Microservices - Why This Architecture

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Problem Statement

When one microservice fails or slows down, it can cause other connected services to also fail or become slow, creating a chain reaction that brings down the entire system. This cascading failure happens because services keep waiting for responses or retrying endlessly, exhausting resources and causing widespread outages.

Solution

Resilience techniques add safeguards like timeouts, retries with limits, and fallback responses to each service. These controls stop failures from spreading by quickly detecting problems and isolating them, so one service's trouble doesn't overwhelm others. This keeps the system stable and responsive even when parts fail.

Architecture

Service A

→Service B

↓

Timeout &

Retry Logic

This diagram shows three microservices calling each other in sequence, each protected by resilience mechanisms like timeouts, circuit breakers, and fallback responses to prevent failure from spreading.

Trade-offs

✓ Pros

→

Prevents one service failure from causing system-wide outages.

→

Improves overall system availability and user experience.

→

Allows graceful degradation by providing fallback responses.

→

Detects failures quickly to reduce wasted resources on retries.

✗ Cons

→

Adds complexity to service code and configuration.

→

May mask underlying problems if fallbacks are overused.

→

Requires careful tuning of timeouts and retry policies.

Use when your system has multiple dependent microservices with inter-service calls and you expect partial failures or network issues at scale above hundreds of requests per second.

Avoid if your system is a simple monolith or has very low traffic (under 100 requests per second) where the overhead of resilience mechanisms outweighs benefits.

Real World Examples

Netflix

Netflix uses circuit breakers and fallback responses in its microservices to prevent cascading failures during high traffic spikes or partial outages, ensuring continuous streaming.

Amazon

Amazon applies resilience patterns like retries with exponential backoff and circuit breakers to isolate failures in its order processing microservices, preventing system-wide order delays.

Uber

Uber implements timeouts and fallback logic in its ride matching services to avoid cascading failures when some services become slow or unresponsive during peak demand.

Code Example

The before code calls another service without limits, risking long waits or failures spreading. The after code adds a timeout to stop waiting too long and a circuit breaker to stop calling the failing service after repeated errors, returning a fallback response instead.

Microservices

### Before resilience (naive call without timeout or circuit breaker)
import requests

def call_service_b():
    response = requests.get('http://service-b/api/data')
    return response.json()


### After applying resilience with timeout and circuit breaker
import requests
from circuitbreaker import circuit

@circuit(failure_threshold=3, recovery_timeout=10)
def call_service_b():
    try:
        response = requests.get('http://service-b/api/data', timeout=2)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException:
        return {'data': 'fallback response'}

OutputSuccess

Alternatives

Bulkhead Isolation

Divides system resources into isolated pools to contain failures, rather than relying on timeouts and retries.

Use when: Choose when resource exhaustion is a major risk and you want to prevent one service from consuming all resources.

Load Shedding

Drops or rejects requests under high load to prevent overload, instead of trying to handle all requests with resilience.

Use when: Choose when protecting system stability under extreme load is more important than serving every request.

Summary

Cascading failures happen when one service's problem spreads to others, causing system-wide outages.

Resilience techniques like timeouts, retries, circuit breakers, and fallbacks isolate failures and keep the system stable.

Applying resilience is essential in microservices to maintain availability and prevent widespread failures.

Practice

(1/5)

1. What is the main reason resilience techniques are used in microservices architectures?

easy

A. To increase the speed of all services regardless of failures

B. To make services use less memory

C. To reduce the number of services in the system

D. To prevent one service failure from causing other services to fail

Why resilience prevents cascading failures in Microservices - Why This Architecture

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of resilience

Step 2: Identify the effect on cascading failures

Final Answer:

Quick Check:

Solution

Step 1: Understand retry and timeout order

Step 2: Check option correctness

Final Answer:

Quick Check:

Solution

Step 1: Analyze retry behavior

Step 2: Consider timeout and success timing

Final Answer:

Quick Check:

Solution

Step 1: Understand circuit breaker failure threshold

Step 2: Analyze early opening

Final Answer:

Quick Check:

Solution

Step 1: Identify resilience patterns that isolate failures

Step 2: Evaluate options for preventing cascading failures

Final Answer:

Quick Check: