HLDsystem_design~25 mins

Circuit breaker pattern in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Circuit Breaker Pattern Implementation

Design the circuit breaker pattern as a reusable component integrated with service calls. Out of scope: detailed implementation of external services or fallback logic.

Functional Requirements

FR1: Detect failures in calls to external services or components

FR2: Prevent repeated calls to failing services to avoid cascading failures

FR3: Automatically retry calls after a cooldown period

FR4: Provide fallback responses when the external service is unavailable

FR5: Support monitoring of circuit breaker state and metrics

Non-Functional Requirements

NFR1: Handle up to 10,000 requests per second

NFR2: Fail fast with p99 latency under 100ms for service calls

NFR3: Ensure availability of 99.9% for the main application

NFR4: Minimal added latency when circuit breaker is closed (normal operation)

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

Key Components

Circuit breaker state machine (Closed, Open, Half-Open)

Failure detection and counting mechanism

Timeout and retry scheduler

Fallback handler

Metrics and monitoring system

Design Patterns

State machine pattern for managing circuit states

Timeout and retry pattern

Bulkhead pattern to isolate failures

Fallback pattern for degraded responses

Observer pattern for monitoring state changes

Reference Architecture

Client Service
   |
   |---> Circuit Breaker Component ---+---> External Service
                                      |
                                      +---> Fallback Handler
                                      |
                                      +---> Metrics & Monitoring

Components

Circuit Breaker Component

In-memory state machine or distributed cache (e.g., Redis)

Track failure counts, manage states (Closed, Open, Half-Open), and decide if calls should proceed

External Service

Any third-party or internal service

Service being called which may fail or respond slowly

Fallback Handler

Custom code or default response generator

Provide alternative response when circuit is open

Metrics & Monitoring

Prometheus, Grafana, or similar

Collect and visualize circuit breaker state, failure rates, and latency

Request Flow

1. Client Service sends request to Circuit Breaker Component

2. Circuit Breaker checks current state:

3. - If Closed: forwards request to External Service

4. - If Open: immediately returns fallback response

5. - If Half-Open: allows limited requests to test service health

6. External Service responds or fails

7. Circuit Breaker updates failure/success counters based on response

8. If failures exceed threshold, Circuit Breaker transitions to Open state

9. After cooldown period, Circuit Breaker moves to Half-Open to test service

10. Metrics & Monitoring collects state changes and performance data

Database Schema

No persistent database required for core circuit breaker; uses in-memory or distributed cache to store: - CircuitBreakerState { service_id, state (Closed/Open/Half-Open), failure_count, last_failure_time, last_state_change_time } - Configuration { failure_threshold, timeout_duration, retry_interval } Relationships: One-to-one mapping between service_id and CircuitBreakerState

Scaling Discussion

Bottlenecks

Single instance circuit breaker state causing inconsistent behavior in distributed systems

High latency added by synchronous state checks

Memory overhead if many services or endpoints use circuit breakers

Delayed detection of failures due to slow failure count updates

Solutions

Use distributed cache (e.g., Redis) or shared state store for circuit breaker state to synchronize across instances

Implement asynchronous state updates and non-blocking calls to minimize added latency

Apply circuit breaker only to critical or high-risk services to reduce memory usage

Tune failure detection thresholds and use sliding windows for faster failure detection

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying failure scenarios, 20 minutes designing the architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain the purpose of the circuit breaker pattern to prevent cascading failures

Describe the three states and transitions clearly

Discuss how fallback responses improve user experience during failures

Highlight monitoring importance for operational visibility

Address scaling challenges in distributed environments and solutions