Bird
Raised Fist0
Microservicessystem_design~7 mins

Circuit breaker pattern in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
When a microservice calls another service that is slow or down, the calling service waits too long or fails repeatedly, causing cascading failures and degraded user experience across the system.
Solution
The circuit breaker monitors calls to a service and stops requests when failures exceed a threshold. It quickly fails requests instead of waiting, then periodically tests if the service has recovered before resuming normal calls.
Architecture
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Client        │──────▶│ Circuit Breaker│──────▶│ Downstream    │
│ (Caller)      │       │ (Monitor &    │       │ Service       │
│               │       │  Control)     │       │               │
└───────────────┘       └───────────────┘       └───────────────┘
       ▲                      │  ▲                     │
       │                      │  │                     │
       │                      │  └───── Failure count ─┘
       │                      │
       └───────── Success / Failure feedback ──────────┘

This diagram shows the client calling a downstream service through a circuit breaker that monitors success and failure to decide whether to allow or block requests.

Trade-offs
✓ Pros
Prevents cascading failures by stopping calls to failing services quickly.
Improves system stability and responsiveness under partial outages.
Allows automatic recovery by periodically testing service health.
✗ Cons
Adds complexity to service communication logic.
Requires tuning thresholds and timeouts to avoid false positives or negatives.
May cause temporary denial of service if the breaker trips incorrectly.
Use when your system has multiple microservices with network calls that can fail or become slow, especially at scales above hundreds of requests per second where failures impact user experience.
Avoid if your service calls are always local or guaranteed reliable, or if your traffic is very low (under 100 requests per second) where failure impact is minimal.
Real World Examples
Netflix
Netflix uses circuit breakers in its microservices to prevent failures in one service from cascading and causing widespread outages during high traffic events.
Amazon
Amazon applies circuit breakers to isolate failing downstream services during peak shopping times to maintain overall system responsiveness.
Uber
Uber uses circuit breakers to handle unreliable third-party APIs and internal services, ensuring degraded but stable user experience.
Code Example
The before code calls the external service directly, risking cascading failures. The after code wraps the call in a CircuitBreaker class that tracks failures and blocks calls when failures exceed a threshold, allowing recovery testing after a timeout.
Microservices
### Before: No circuit breaker, direct call
class ServiceClient:
    def call_service(self):
        response = external_service_request()
        return response


### After: Circuit breaker applied
import time

class CircuitBreaker:
    def __init__(self, failure_threshold=3, recovery_time=10):
        self.failure_threshold = failure_threshold
        self.recovery_time = recovery_time
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.recovery_time:
                self.state = 'HALF_OPEN'
            else:
                raise Exception('Circuit breaker is OPEN')

        try:
            result = func(*args, **kwargs)
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'
            raise
        else:
            self.failure_count = 0
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
            return result


class ServiceClientWithCB:
    def __init__(self):
        self.circuit_breaker = CircuitBreaker()

    def call_service(self):
        return self.circuit_breaker.call(external_service_request)


# external_service_request is a placeholder for the actual call
OutputSuccess
Alternatives
Retry pattern
Retries failed requests a fixed number of times before failing, without blocking calls.
Use when: Use when failures are transient and quick retries can succeed without risking cascading failures.
Bulkhead pattern
Isolates failures by partitioning resources so one failure does not affect others, rather than blocking calls.
Use when: Use when you want to limit failure impact by resource isolation instead of blocking requests.
Summary
Circuit breaker pattern prevents cascading failures by stopping calls to failing services after repeated errors.
It improves system stability by quickly failing requests and testing service recovery before resuming calls.
This pattern is essential in microservices architectures with unreliable network calls at scale.

Practice

(1/5)
1. What is the primary purpose of the circuit breaker pattern in microservices?
easy
A. To prevent repeated calls to a failing service and improve system stability
B. To increase the speed of database queries
C. To encrypt communication between services
D. To balance load evenly across servers

Solution

  1. Step 1: Understand the problem circuit breaker solves

    The circuit breaker pattern stops calls to a failing service to avoid cascading failures.
  2. Step 2: Identify the main benefit

    This pattern improves system stability by preventing repeated failures and allowing recovery.
  3. Final Answer:

    To prevent repeated calls to a failing service and improve system stability -> Option A
  4. Quick Check:

    Circuit breaker purpose = prevent repeated failing calls [OK]
Hint: Circuit breaker stops calls to failing services fast [OK]
Common Mistakes:
  • Confusing circuit breaker with load balancing
  • Thinking it speeds up database queries
  • Assuming it encrypts data
2. Which of the following correctly represents the three states of a circuit breaker?
easy
A. START, STOP, PAUSE
B. ACTIVE, INACTIVE, PENDING
C. CLOSED, OPEN, HALF_OPEN
D. ON, OFF, WAIT

Solution

  1. Step 1: Recall circuit breaker states

    The circuit breaker has three states: CLOSED (normal), OPEN (blocking calls), HALF_OPEN (testing recovery).
  2. Step 2: Match states to options

    Only CLOSED, OPEN, HALF_OPEN lists these exact states.
  3. Final Answer:

    CLOSED, OPEN, HALF_OPEN -> Option C
  4. Quick Check:

    States = CLOSED, OPEN, HALF_OPEN [OK]
Hint: Remember states as Closed, Open, Half-Open [OK]
Common Mistakes:
  • Mixing up state names with unrelated terms
  • Using generic terms like ON/OFF
  • Forgetting the HALF_OPEN state
3. Consider this pseudocode for a circuit breaker:
if state == 'OPEN':
  return 'fail fast'
elif state == 'HALF_OPEN':
  if test_call_successful():
    state = 'CLOSED'
  else:
    state = 'OPEN'
else:
  call_service()
What happens when the circuit breaker is in HALF_OPEN state and the test call fails?
medium
A. The state changes to CLOSED and service calls continue
B. The state remains HALF_OPEN and retries immediately
C. The service call is ignored without state change
D. The state changes back to OPEN and calls are blocked

Solution

  1. Step 1: Analyze HALF_OPEN state logic

    In HALF_OPEN, a test call checks if the service recovered. If it fails, the state changes to OPEN.
  2. Step 2: Understand consequence of failure

    Changing to OPEN blocks further calls to prevent overload.
  3. Final Answer:

    The state changes back to OPEN and calls are blocked -> Option D
  4. Quick Check:

    HALF_OPEN fail -> OPEN state [OK]
Hint: Failed test call in HALF_OPEN resets to OPEN [OK]
Common Mistakes:
  • Assuming state changes to CLOSED on failure
  • Thinking retries happen immediately in HALF_OPEN
  • Ignoring state changes on test failure
4. A developer implemented a circuit breaker but notices it never transitions from OPEN to HALF_OPEN. What is the most likely cause?
medium
A. The timeout to switch from OPEN to HALF_OPEN is missing or too long
B. The service calls are always successful
C. The circuit breaker is stuck in CLOSED state
D. The test call in HALF_OPEN always succeeds

Solution

  1. Step 1: Understand OPEN to HALF_OPEN transition

    The circuit breaker moves from OPEN to HALF_OPEN after a timeout period to test recovery.
  2. Step 2: Identify cause of no transition

    If the timeout is missing or set too long, the breaker stays OPEN indefinitely.
  3. Final Answer:

    The timeout to switch from OPEN to HALF_OPEN is missing or too long -> Option A
  4. Quick Check:

    Missing timeout blocks OPEN -> HALF_OPEN transition [OK]
Hint: Check timeout settings for OPEN to HALF_OPEN switch [OK]
Common Mistakes:
  • Assuming success of service calls affects OPEN state
  • Confusing CLOSED and OPEN states
  • Ignoring timeout mechanism
5. You design a microservice system with a circuit breaker protecting a payment service. The circuit breaker trips (opens) after 5 failures within 1 minute and stays open for 2 minutes before trying again. What is the main tradeoff of setting the open duration too long?
hard
A. Long open duration improves user experience by retrying quickly
B. Long open duration reduces load on failing service but increases request failures for users
C. Long open duration causes the circuit breaker to never open
D. Long open duration increases the number of successful calls

Solution

  1. Step 1: Understand open duration effect

    A long open duration blocks calls longer, reducing load on the failing service.
  2. Step 2: Identify user impact

    While protecting the service, users experience more failures because calls are blocked longer.
  3. Final Answer:

    Long open duration reduces load on failing service but increases request failures for users -> Option B
  4. Quick Check:

    Long open = less load, more user failures [OK]
Hint: Long open = safer service, worse user experience [OK]
Common Mistakes:
  • Thinking long open improves user experience
  • Assuming circuit breaker never opens with long duration
  • Believing long open increases successful calls