0
0
Microservicessystem_design~25 mins

Chaos engineering basics in Microservices - System Design Exercise

Choose your learning style9 modes available
Design: Chaos Engineering Platform for Microservices
Design the chaos engineering platform components and integration with microservices. Out of scope: detailed microservice implementation or business logic.
Functional Requirements
FR1: Inject failures like latency, errors, and service crashes into microservices
FR2: Monitor system behavior and detect resilience issues automatically
FR3: Support scheduling and targeting specific microservices or endpoints
FR4: Provide dashboards for real-time metrics and failure impact visualization
FR5: Allow safe rollback and stop of experiments to avoid production damage
Non-Functional Requirements
NFR1: Handle up to 100 microservices in production
NFR2: Minimal impact on normal system latency (p99 < 200ms overhead)
NFR3: Availability target 99.9% uptime for the chaos platform itself
NFR4: Experiments must be isolated and reversible
NFR5: Secure access control for who can run chaos experiments
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Failure injection agents or sidecars in microservices
Central chaos control service with API and UI
Metrics collection and monitoring system
Experiment scheduler and rollback mechanism
Authentication and authorization for experiment control
Design Patterns
Circuit breaker pattern for resilience
Bulkhead isolation to limit failure blast radius
Canary deployments for safe testing
Event-driven monitoring and alerting
Feature flags to enable/disable chaos experiments
Reference Architecture
                    +-----------------------+
                    | Chaos Control Service  |
                    |  - API & UI           |
                    |  - Scheduler          |
                    |  - Auth & Rollback    |
                    +-----------+-----------+
                                |
                                v
       +------------------------+------------------------+
       | Failure Injection Agents (sidecars) in each    |
       | microservice instance                            |
       +------------------------+------------------------+
                                |
                                v
                    +-----------------------+
                    | Monitoring & Metrics   |
                    |  - Collect logs       |
                    |  - Track latency      |
                    |  - Alerting system    |
                    +-----------------------+
Components
Chaos Control Service
Node.js/Go REST API with React UI
Central interface to create, schedule, monitor, and rollback chaos experiments
Failure Injection Agents
Sidecar containers or middleware in microservices
Inject faults like latency, errors, or crashes into targeted microservices
Monitoring & Metrics System
Prometheus + Grafana + Alertmanager
Collect metrics and logs to observe system behavior during chaos experiments
Authentication & Authorization
OAuth2 / RBAC
Control who can run or stop chaos experiments to ensure security
Experiment Scheduler
Cron jobs or Kubernetes CronJobs
Automate running chaos experiments at defined times or intervals
Request Flow
1. User logs into Chaos Control Service UI with secure credentials.
2. User creates a chaos experiment specifying target microservices and failure types.
3. Scheduler triggers the experiment at the scheduled time.
4. Chaos Control Service sends commands to Failure Injection Agents in targeted microservices.
5. Agents inject faults (e.g., add latency, return errors) as instructed.
6. Monitoring system collects metrics and logs from microservices during experiment.
7. Chaos Control Service displays real-time impact on dashboard.
8. User can stop or rollback the experiment via the control service.
9. Agents revert injected faults to restore normal service behavior.
Database Schema
Entities: - User (id, username, password_hash, role) - Experiment (id, name, description, target_services, failure_types, schedule, status, created_at, updated_at) - Service (id, name, version, metadata) - ExperimentLog (id, experiment_id, timestamp, metric_name, metric_value) Relationships: - User creates many Experiments (1:N) - Experiment targets many Services (N:N) - Experiment has many ExperimentLogs (1:N)
Scaling Discussion
Bottlenecks
Chaos Control Service becomes overloaded with many concurrent experiments
Failure Injection Agents add latency overhead to microservices
Monitoring system overwhelmed by high volume of metrics during experiments
Security risks if unauthorized users access chaos controls
Rollback delays causing prolonged system instability
Solutions
Scale Chaos Control Service horizontally with load balancing and stateless design
Optimize agents to minimize latency; use sampling or throttling of fault injection
Use scalable monitoring backends like Thanos or Cortex for Prometheus metrics
Implement strict RBAC and audit logging for experiment control access
Automate rollback triggers based on anomaly detection to reduce manual delays
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing components and data flow, 10 minutes discussing scaling and security, 5 minutes summarizing.
Explain importance of safe, reversible chaos experiments in production
Describe how failure injection agents work and integrate with microservices
Highlight monitoring and metrics collection for impact analysis
Discuss security controls to prevent misuse
Address scaling challenges and solutions for large microservice environments