Bird
Raised Fist0
Microservicessystem_design~7 mins

Chaos engineering basics in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
Unexpected failures in complex microservices systems cause outages and degrade user experience. Without proactive testing, these failures remain hidden until they cause serious damage, making recovery slow and unpredictable.
Solution
Chaos engineering introduces controlled, deliberate failures into a system to observe how it behaves under stress. By simulating outages and faults in production-like environments, teams identify weaknesses and improve system resilience before real incidents occur.
Architecture
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│   Chaos Tool  │──────▶│ Microservices │──────▶│ Monitoring &  │
│ (Inject Fault)│       │   System      │       │   Alerting    │
└───────────────┘       └───────────────┘       └───────────────┘
         │                      │                      ▲
         │                      │                      │
         └──────────────────────┴──────────────────────┘

This diagram shows a chaos engineering tool injecting faults into a microservices system, which is monitored to detect failures and trigger alerts.

Trade-offs
✓ Pros
Reveals hidden system weaknesses before real failures occur.
Improves system reliability by validating recovery processes.
Builds confidence in system stability under unpredictable conditions.
✗ Cons
Requires careful planning to avoid causing real user impact.
Adds complexity to testing and monitoring infrastructure.
Needs cultural buy-in as it challenges traditional testing mindsets.
Use when operating complex distributed systems with multiple microservices and when uptime and reliability are critical, typically at scale of hundreds or more services.
Avoid if system is very simple or in early development stages where basic functionality is not stable, or if monitoring and alerting are insufficient to detect and respond to injected failures.
Real World Examples
Netflix
Netflix uses Chaos Monkey to randomly terminate instances in production to ensure their streaming service can tolerate failures without user impact.
Amazon
Amazon employs chaos engineering to test the resilience of their retail platform by simulating failures in their distributed services during peak traffic.
LinkedIn
LinkedIn runs chaos experiments to validate their microservices' ability to recover from network partitions and service crashes.
Code Example
The before code calls the payment gateway directly without testing failure handling. The after code injects random failures to simulate outages, allowing the system to be tested for resilience and recovery.
Microservices
### Before: No chaos engineering
class PaymentService:
    def process_payment(self, amount):
        # Directly calls external payment gateway
        response = external_gateway.charge(amount)
        return response


### After: With chaos engineering fault injection
import random

class PaymentService:
    def process_payment(self, amount):
        # Inject random failure to simulate gateway outage
        if random.random() < 0.1:  # 10% failure rate
            raise Exception("Simulated payment gateway failure")
        response = external_gateway.charge(amount)
        return response
OutputSuccess
Alternatives
Load Testing
Focuses on testing system performance under high traffic rather than injecting failures to test resilience.
Use when: Choose load testing when the main concern is system capacity and throughput rather than fault tolerance.
Blue-Green Deployment
Switches traffic between two identical environments to reduce downtime, but does not actively test failure scenarios.
Use when: Choose blue-green deployment to minimize downtime during releases rather than to test system robustness.
Summary
Chaos engineering helps find hidden failures by injecting controlled faults into systems.
It improves reliability by testing how systems recover from unexpected problems.
This practice requires careful planning and monitoring to avoid real user impact.

Practice

(1/5)
1. What is the main goal of chaos engineering in microservices?
easy
A. To reduce the number of developers needed
B. To increase the number of microservices in a system
C. To find and fix weaknesses before real failures occur
D. To speed up the deployment process

Solution

  1. Step 1: Understand chaos engineering purpose

    Chaos engineering is about testing systems by intentionally causing failures to find weaknesses.
  2. Step 2: Identify the main goal

    The goal is to find and fix weaknesses before they cause real problems in production.
  3. Final Answer:

    To find and fix weaknesses before real failures occur -> Option C
  4. Quick Check:

    Chaos engineering goal = Find and fix weaknesses [OK]
Hint: Chaos engineering tests failures to improve system stability [OK]
Common Mistakes:
  • Thinking chaos engineering increases microservices count
  • Confusing chaos engineering with deployment speedup
  • Assuming chaos engineering reduces developer count
2. Which of the following is a correct way to start chaos engineering experiments?
easy
A. Start with complex multi-service failures immediately
B. Begin with simple, controlled failure tests
C. Run chaos tests only after a system crash
D. Avoid monitoring during chaos experiments

Solution

  1. Step 1: Review best practice for chaos experiments

    Best practice is to start small with simple, controlled failures to understand system behavior.
  2. Step 2: Identify the correct starting approach

    Starting with simple tests helps safely learn and improve system resilience gradually.
  3. Final Answer:

    Begin with simple, controlled failure tests -> Option B
  4. Quick Check:

    Start chaos with simple tests = Begin with simple, controlled failure tests [OK]
Hint: Start chaos tests simple and controlled, not complex [OK]
Common Mistakes:
  • Starting with complex failures too soon
  • Running chaos only after failures happen
  • Ignoring monitoring during tests
3. Consider a microservice system where a chaos experiment randomly kills one instance every 5 minutes. What is the expected immediate effect on system availability?
medium
A. System availability remains stable if redundancy exists
B. System availability drops to zero immediately
C. System crashes permanently after first kill
D. System automatically scales down instances

Solution

  1. Step 1: Analyze the chaos experiment impact

    Killing one instance every 5 minutes tests resilience but does not remove all instances.
  2. Step 2: Consider system redundancy

    If the system has redundant instances, killing one does not reduce availability immediately.
  3. Final Answer:

    System availability remains stable if redundancy exists -> Option A
  4. Quick Check:

    Redundancy keeps availability stable during chaos [OK]
Hint: Redundancy keeps system available despite instance failures [OK]
Common Mistakes:
  • Assuming system crashes immediately after one instance killed
  • Thinking availability drops to zero instantly
  • Believing system scales down automatically
4. A chaos experiment script intended to shut down a microservice instance sometimes fails silently without stopping the instance. What is the most likely cause?
medium
A. The network is too fast for the script
B. The microservice is designed to never stop
C. The chaos experiment is running on a different system
D. The script lacks proper error handling and logging

Solution

  1. Step 1: Identify why script fails silently

    Silent failures usually happen when errors are not caught or logged properly.
  2. Step 2: Evaluate other options

    Microservices can be stopped; network speed does not cause silent failure; running on different system would cause errors, not silent failure.
  3. Final Answer:

    The script lacks proper error handling and logging -> Option D
  4. Quick Check:

    Silent failure = Missing error handling [OK]
Hint: Check error handling if chaos script fails silently [OK]
Common Mistakes:
  • Assuming microservice cannot be stopped
  • Blaming network speed for silent failure
  • Ignoring script environment mismatch
5. You want to design a chaos engineering experiment to test how your microservices handle database latency spikes. Which approach best fits this goal?
hard
A. Inject artificial latency into database calls during tests
B. Disable monitoring tools to avoid false alerts
C. Increase the number of database replicas without testing
D. Randomly kill microservice instances during peak hours

Solution

  1. Step 1: Understand the goal of testing database latency spikes

    The goal is to see how microservices behave when database responses are slow.
  2. Step 2: Choose the best chaos experiment approach

    Injecting artificial latency simulates slow database calls directly, matching the goal.
  3. Step 3: Evaluate other options

    Killing instances tests availability, not latency; increasing replicas without testing doesn't simulate latency; disabling monitoring hides important data.
  4. Final Answer:

    Inject artificial latency into database calls during tests -> Option A
  5. Quick Check:

    Test latency by injecting delays = Inject artificial latency into database calls during tests [OK]
Hint: Inject delays to test latency, not kill instances [OK]
Common Mistakes:
  • Confusing instance failure with latency testing
  • Adding replicas without testing effects
  • Turning off monitoring during chaos