Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Chaos Engineering Platform for Microservices
Design the chaos engineering platform components and integration with microservices. Out of scope: detailed microservice implementation or business logic.
Functional Requirements
FR1: Inject failures like latency, errors, and service crashes into microservices
FR2: Monitor system behavior and detect resilience issues automatically
FR3: Support scheduling and targeting specific microservices or endpoints
FR4: Provide dashboards for real-time metrics and failure impact visualization
FR5: Allow safe rollback and stop of experiments to avoid production damage
Non-Functional Requirements
NFR1: Handle up to 100 microservices in production
NFR2: Minimal impact on normal system latency (p99 < 200ms overhead)
NFR3: Availability target 99.9% uptime for the chaos platform itself
NFR4: Experiments must be isolated and reversible
NFR5: Secure access control for who can run chaos experiments
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Failure injection agents or sidecars in microservices
Central chaos control service with API and UI
Metrics collection and monitoring system
Experiment scheduler and rollback mechanism
Authentication and authorization for experiment control
Design Patterns
Circuit breaker pattern for resilience
Bulkhead isolation to limit failure blast radius
Canary deployments for safe testing
Event-driven monitoring and alerting
Feature flags to enable/disable chaos experiments
Reference Architecture
+-----------------------+
| Chaos Control Service |
| - API & UI |
| - Scheduler |
| - Auth & Rollback |
+-----------+-----------+
|
v
+------------------------+------------------------+
| Failure Injection Agents (sidecars) in each |
| microservice instance |
+------------------------+------------------------+
|
v
+-----------------------+
| Monitoring & Metrics |
| - Collect logs |
| - Track latency |
| - Alerting system |
+-----------------------+
Components
Chaos Control Service
Node.js/Go REST API with React UI
Central interface to create, schedule, monitor, and rollback chaos experiments
Failure Injection Agents
Sidecar containers or middleware in microservices
Inject faults like latency, errors, or crashes into targeted microservices
Monitoring & Metrics System
Prometheus + Grafana + Alertmanager
Collect metrics and logs to observe system behavior during chaos experiments
Authentication & Authorization
OAuth2 / RBAC
Control who can run or stop chaos experiments to ensure security
Experiment Scheduler
Cron jobs or Kubernetes CronJobs
Automate running chaos experiments at defined times or intervals
Request Flow
1. User logs into Chaos Control Service UI with secure credentials.
2. User creates a chaos experiment specifying target microservices and failure types.
3. Scheduler triggers the experiment at the scheduled time.
4. Chaos Control Service sends commands to Failure Injection Agents in targeted microservices.
2. Which of the following is a correct way to start chaos engineering experiments?
easy
A. Start with complex multi-service failures immediately
B. Begin with simple, controlled failure tests
C. Run chaos tests only after a system crash
D. Avoid monitoring during chaos experiments
Solution
Step 1: Review best practice for chaos experiments
Best practice is to start small with simple, controlled failures to understand system behavior.
Step 2: Identify the correct starting approach
Starting with simple tests helps safely learn and improve system resilience gradually.
Final Answer:
Begin with simple, controlled failure tests -> Option B
Quick Check:
Start chaos with simple tests = Begin with simple, controlled failure tests [OK]
Hint: Start chaos tests simple and controlled, not complex [OK]
Common Mistakes:
Starting with complex failures too soon
Running chaos only after failures happen
Ignoring monitoring during tests
3. Consider a microservice system where a chaos experiment randomly kills one instance every 5 minutes. What is the expected immediate effect on system availability?
medium
A. System availability remains stable if redundancy exists
B. System availability drops to zero immediately
C. System crashes permanently after first kill
D. System automatically scales down instances
Solution
Step 1: Analyze the chaos experiment impact
Killing one instance every 5 minutes tests resilience but does not remove all instances.
Step 2: Consider system redundancy
If the system has redundant instances, killing one does not reduce availability immediately.
Final Answer:
System availability remains stable if redundancy exists -> Option A
Quick Check:
Redundancy keeps availability stable during chaos [OK]
Hint: Redundancy keeps system available despite instance failures [OK]
Common Mistakes:
Assuming system crashes immediately after one instance killed
Thinking availability drops to zero instantly
Believing system scales down automatically
4. A chaos experiment script intended to shut down a microservice instance sometimes fails silently without stopping the instance. What is the most likely cause?
medium
A. The network is too fast for the script
B. The microservice is designed to never stop
C. The chaos experiment is running on a different system
D. The script lacks proper error handling and logging
Solution
Step 1: Identify why script fails silently
Silent failures usually happen when errors are not caught or logged properly.
Step 2: Evaluate other options
Microservices can be stopped; network speed does not cause silent failure; running on different system would cause errors, not silent failure.
Final Answer:
The script lacks proper error handling and logging -> Option D
Quick Check:
Silent failure = Missing error handling [OK]
Hint: Check error handling if chaos script fails silently [OK]
Common Mistakes:
Assuming microservice cannot be stopped
Blaming network speed for silent failure
Ignoring script environment mismatch
5. You want to design a chaos engineering experiment to test how your microservices handle database latency spikes. Which approach best fits this goal?
hard
A. Inject artificial latency into database calls during tests
B. Disable monitoring tools to avoid false alerts
C. Increase the number of database replicas without testing
D. Randomly kill microservice instances during peak hours
Solution
Step 1: Understand the goal of testing database latency spikes
The goal is to see how microservices behave when database responses are slow.