Bird
Raised Fist0
Microservicessystem_design~7 mins

Rollback strategies in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
When a new version of a microservice is deployed and contains bugs or causes failures, the system can become unstable or unusable. Without a clear rollback plan, fixing these issues can take a long time, causing downtime and loss of user trust.
Solution
Rollback strategies provide a controlled way to revert a microservice to a previous stable version quickly. This is done by keeping previous versions ready and switching traffic back to them when problems arise, minimizing downtime and impact on users.
Architecture
User Request
API Gateway
Deployment Tool

This diagram shows how user requests flow through an API Gateway to the active microservice version. The Deployment Tool manages microservice versions and can switch traffic back to a previous version to rollback.

Trade-offs
✓ Pros
Minimizes downtime by quickly reverting to a known stable version.
Reduces risk of prolonged outages caused by faulty deployments.
Supports continuous delivery by enabling safe experimentation.
Improves user experience by maintaining service availability.
✗ Cons
Requires maintaining multiple versions and extra storage.
Rollback may cause data inconsistencies if schema changes are involved.
Complexity increases with dependencies between microservices.
Use rollback strategies when deploying microservices in production environments with frequent releases and when uptime is critical, typically at scales of hundreds or more requests per second.
Avoid rollback strategies in very simple systems with infrequent deployments or when the cost of maintaining multiple versions outweighs the benefits, such as small internal tools with low traffic.
Real World Examples
Netflix
Netflix uses automated rollback strategies to revert microservice versions instantly when new deployments cause errors, ensuring uninterrupted streaming.
Uber
Uber employs rollback strategies to quickly switch back to previous microservice versions during incidents, minimizing impact on ride requests.
Amazon
Amazon uses rollback mechanisms in their deployment pipelines to maintain high availability of their e-commerce services during frequent updates.
Code Example
Before applying rollback strategies, the microservice runs a buggy version without a way to revert quickly. After applying rollback, the service keeps multiple versions and can switch the active version back to a stable one instantly, minimizing downtime.
Microservices
### Before rollback strategy (naive deployment)
class Microservice:
    def __init__(self):
        self.version = 'v2.0'

    def handle_request(self, request):
        if self.version == 'v2.0':
            # buggy code
            return 'error'
        else:
            return 'ok'

service = Microservice()
print(service.handle_request('request'))  # returns 'error'


### After rollback strategy applied
class Microservice:
    def __init__(self):
        self.active_version = 'v2.0'
        self.versions = {
            'v1.0': self.v1_0_handler,
            'v2.0': self.v2_0_handler
        }

    def v1_0_handler(self, request):
        return 'ok'

    def v2_0_handler(self, request):
        # buggy code
        return 'error'

    def handle_request(self, request):
        return self.versions[self.active_version](request)

    def rollback(self):
        self.active_version = 'v1.0'

service = Microservice()
print(service.handle_request('request'))  # returns 'error'
service.rollback()
print(service.handle_request('request'))  # returns 'ok'
OutputSuccess
Alternatives
Blue-Green Deployment
Deploys new version alongside the old one and switches traffic atomically, avoiding downtime without immediate rollback.
Use when: Choose when you want zero downtime deployments and can afford double infrastructure temporarily.
Canary Deployment
Gradually shifts traffic to the new version to detect issues early before full rollout, reducing rollback frequency.
Use when: Choose when you want to test new versions on a small user subset before full deployment.
Feature Flags
Controls new features at runtime without redeploying, allowing quick disable instead of full rollback.
Use when: Choose when you want fine-grained control over features and faster recovery from issues.
Summary
Rollback strategies prevent prolonged outages by quickly reverting to stable microservice versions.
They require maintaining multiple versions and managing traffic routing between them.
Rollback is essential in production systems with frequent deployments and high availability needs.

Practice

(1/5)
1. What is the main purpose of a rollback strategy in microservices?
easy
A. To quickly undo a bad deployment and restore the previous stable state
B. To add new features to the system without downtime
C. To permanently delete old versions of services
D. To monitor system performance continuously

Solution

  1. Step 1: Understand rollback purpose

    Rollback strategies are designed to revert changes that cause issues, restoring stability.
  2. Step 2: Identify correct purpose in options

    Only To quickly undo a bad deployment and restore the previous stable state describes undoing a bad deployment to restore a stable state.
  3. Final Answer:

    To quickly undo a bad deployment and restore the previous stable state -> Option A
  4. Quick Check:

    Rollback purpose = Undo bad deployment [OK]
Hint: Rollback means undo bad changes fast [OK]
Common Mistakes:
  • Confusing rollback with feature deployment
  • Thinking rollback deletes old versions permanently
  • Mixing rollback with monitoring
2. Which of the following is a correct description of the blue-green deployment rollback method?
easy
A. Switch traffic back to the old environment if the new one fails
B. Gradually increase traffic to the new version while monitoring
C. Manually fix database schema errors after deployment
D. Deploy new code directly to production without testing

Solution

  1. Step 1: Recall blue-green deployment basics

    Blue-green uses two identical environments; one active, one idle for new version.
  2. Step 2: Identify rollback action

    If new version fails, traffic switches back to old environment instantly.
  3. Final Answer:

    Switch traffic back to the old environment if the new one fails -> Option A
  4. Quick Check:

    Blue-green rollback = Switch traffic back [OK]
Hint: Blue-green rollback switches traffic instantly [OK]
Common Mistakes:
  • Confusing blue-green with canary deployment
  • Thinking rollback fixes database manually
  • Ignoring traffic switching concept
3. Consider this simplified code snippet for a canary deployment rollback trigger:
if error_rate > 0.05:
    rollback_canary()

What happens when the error rate exceeds 5% during canary deployment?
medium
A. The system continues deployment without changes
B. The error rate is ignored and logged only
C. The rollback_canary function is called to revert changes
D. The deployment is paused but not rolled back

Solution

  1. Step 1: Analyze the condition in code

    The code checks if error_rate is greater than 0.05 (5%).
  2. Step 2: Understand the action on condition true

    If true, rollback_canary() is called to revert the canary deployment.
  3. Final Answer:

    The rollback_canary function is called to revert changes -> Option C
  4. Quick Check:

    Error rate > 5% triggers rollback [OK]
Hint: Error rate > threshold triggers rollback function [OK]
Common Mistakes:
  • Ignoring the rollback call in the code
  • Assuming deployment pauses without rollback
  • Confusing logging with rollback action
4. A microservice deployment uses database migration with rollback scripts. The rollback script fails due to a syntax error. What is the best immediate action?
medium
A. Ignore the failure and continue deployment
B. Restart the service without rollback
C. Delete the database and start fresh
D. Manually fix the rollback script and retry rollback

Solution

  1. Step 1: Identify rollback script failure impact

    A syntax error in rollback script prevents safe undo of migration changes.
  2. Step 2: Choose safe recovery action

    Fixing the script manually and retrying rollback ensures data integrity and system stability.
  3. Final Answer:

    Manually fix the rollback script and retry rollback -> Option D
  4. Quick Check:

    Fix rollback script error before retrying [OK]
Hint: Fix rollback script errors before retrying rollback [OK]
Common Mistakes:
  • Ignoring rollback failure and proceeding
  • Deleting database without backup
  • Restarting service without fixing rollback
5. You have a microservices system using canary deployments with automated rollback on failure. Suddenly, a rollback triggers repeatedly due to a false positive error spike caused by monitoring noise. What is the best architectural improvement to reduce unnecessary rollbacks?
hard
A. Disable rollback automation and rely on manual checks
B. Implement a cooldown period before allowing another rollback
C. Remove monitoring to avoid false alarms
D. Rollback immediately on any error spike without delay

Solution

  1. Step 1: Understand problem cause

    False positive error spikes cause repeated rollbacks due to noisy monitoring data.
  2. Step 2: Identify architectural fix

    Adding a cooldown period prevents rapid repeated rollbacks, allowing noise to settle before next rollback.
  3. Final Answer:

    Implement a cooldown period before allowing another rollback -> Option B
  4. Quick Check:

    Cooldown period reduces rollback noise impact [OK]
Hint: Cooldown period prevents rollback storms from noise [OK]
Common Mistakes:
  • Disabling automation loses rollback benefits
  • Removing monitoring hides real issues
  • Rolling back immediately causes instability