Microservicessystem_design~25 mins

Chaos engineering basics in Microservices - System Design Exercise

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Design: Chaos Engineering Platform for Microservices

Design the chaos engineering platform components and integration with microservices. Out of scope: detailed microservice implementation or business logic.

Functional Requirements

FR1: Inject failures like latency, errors, and service crashes into microservices

FR2: Monitor system behavior and detect resilience issues automatically

FR3: Support scheduling and targeting specific microservices or endpoints

FR4: Provide dashboards for real-time metrics and failure impact visualization

FR5: Allow safe rollback and stop of experiments to avoid production damage

Non-Functional Requirements

NFR1: Handle up to 100 microservices in production

NFR2: Minimal impact on normal system latency (p99 < 200ms overhead)

NFR3: Availability target 99.9% uptime for the chaos platform itself

NFR4: Experiments must be isolated and reversible

NFR5: Secure access control for who can run chaos experiments

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

Key Components

Failure injection agents or sidecars in microservices

Central chaos control service with API and UI

Metrics collection and monitoring system

Experiment scheduler and rollback mechanism

Authentication and authorization for experiment control

Design Patterns

Circuit breaker pattern for resilience

Bulkhead isolation to limit failure blast radius

Canary deployments for safe testing

Event-driven monitoring and alerting

Feature flags to enable/disable chaos experiments

Reference Architecture

                    +-----------------------+
                    | Chaos Control Service  |
                    |  - API & UI           |
                    |  - Scheduler          |
                    |  - Auth & Rollback    |
                    +-----------+-----------+
                                |
                                v
       +------------------------+------------------------+
       | Failure Injection Agents (sidecars) in each    |
       | microservice instance                            |
       +------------------------+------------------------+
                                |
                                v
                    +-----------------------+
                    | Monitoring & Metrics   |
                    |  - Collect logs       |
                    |  - Track latency      |
                    |  - Alerting system    |
                    +-----------------------+

Components

Chaos Control Service

Node.js/Go REST API with React UI

Central interface to create, schedule, monitor, and rollback chaos experiments

Failure Injection Agents

Sidecar containers or middleware in microservices

Inject faults like latency, errors, or crashes into targeted microservices

Monitoring & Metrics System

Prometheus + Grafana + Alertmanager

Collect metrics and logs to observe system behavior during chaos experiments

Authentication & Authorization

OAuth2 / RBAC

Control who can run or stop chaos experiments to ensure security

Experiment Scheduler

Cron jobs or Kubernetes CronJobs

Automate running chaos experiments at defined times or intervals

Request Flow

1. User logs into Chaos Control Service UI with secure credentials.

2. User creates a chaos experiment specifying target microservices and failure types.

3. Scheduler triggers the experiment at the scheduled time.

4. Chaos Control Service sends commands to Failure Injection Agents in targeted microservices.

5. Agents inject faults (e.g., add latency, return errors) as instructed.

6. Monitoring system collects metrics and logs from microservices during experiment.

7. Chaos Control Service displays real-time impact on dashboard.

8. User can stop or rollback the experiment via the control service.

9. Agents revert injected faults to restore normal service behavior.

Database Schema

Entities: - User (id, username, password_hash, role) - Experiment (id, name, description, target_services, failure_types, schedule, status, created_at, updated_at) - Service (id, name, version, metadata) - ExperimentLog (id, experiment_id, timestamp, metric_name, metric_value) Relationships: - User creates many Experiments (1:N) - Experiment targets many Services (N:N) - Experiment has many ExperimentLogs (1:N)

Scaling Discussion

Bottlenecks

Chaos Control Service becomes overloaded with many concurrent experiments

Failure Injection Agents add latency overhead to microservices

Monitoring system overwhelmed by high volume of metrics during experiments

Security risks if unauthorized users access chaos controls

Rollback delays causing prolonged system instability

Solutions

Scale Chaos Control Service horizontally with load balancing and stateless design

Optimize agents to minimize latency; use sampling or throttling of fault injection

Use scalable monitoring backends like Thanos or Cortex for Prometheus metrics

Implement strict RBAC and audit logging for experiment control access

Automate rollback triggers based on anomaly detection to reduce manual delays

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing components and data flow, 10 minutes discussing scaling and security, 5 minutes summarizing.

Explain importance of safe, reversible chaos experiments in production

Describe how failure injection agents work and integrate with microservices

Highlight monitoring and metrics collection for impact analysis

Discuss security controls to prevent misuse

Address scaling challenges and solutions for large microservice environments

Practice

(1/5)

1. What is the main goal of chaos engineering in microservices?

easy

A. To reduce the number of developers needed

B. To increase the number of microservices in a system

C. To find and fix weaknesses before real failures occur

D. To speed up the deployment process

Chaos engineering basics in Microservices - System Design Exercise

Start learning this pattern below

Practice

Solution

Step 1: Understand chaos engineering purpose

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Review best practice for chaos experiments

Step 2: Identify the correct starting approach

Final Answer:

Quick Check:

Solution

Step 1: Analyze the chaos experiment impact

Step 2: Consider system redundancy

Final Answer:

Quick Check:

Solution

Step 1: Identify why script fails silently

Step 2: Evaluate other options

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal of testing database latency spikes

Step 2: Choose the best chaos experiment approach

Step 3: Evaluate other options

Final Answer:

Quick Check: