Design: Chaos Engineering Platform for Microservices
Design the chaos engineering platform components and integration with microservices. Out of scope: detailed microservice implementation or business logic.
Functional Requirements
FR1: Inject failures like latency, errors, and service crashes into microservices
FR2: Monitor system behavior and detect resilience issues automatically
FR3: Support scheduling and targeting specific microservices or endpoints
FR4: Provide dashboards for real-time metrics and failure impact visualization
FR5: Allow safe rollback and stop of experiments to avoid production damage
Non-Functional Requirements
NFR1: Handle up to 100 microservices in production
NFR2: Minimal impact on normal system latency (p99 < 200ms overhead)
NFR3: Availability target 99.9% uptime for the chaos platform itself
NFR4: Experiments must be isolated and reversible
NFR5: Secure access control for who can run chaos experiments