Bird
Raised Fist0
Microservicessystem_design~7 mins

Canary deployment in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
Deploying a new version of a service to all users at once can cause widespread failures if the new version has bugs or performance issues. This can lead to downtime, loss of revenue, and damage to user trust.
Solution
Canary deployment solves this by releasing the new version to a small subset of users first. The system monitors this subset for errors or performance problems. If all goes well, the new version is gradually rolled out to more users until full deployment is achieved.
Architecture
Users
Load Balancer
Stable Group
Stable Group

This diagram shows users sending requests to a load balancer that routes a small portion to the canary group running the new version, while the rest go to the stable group running the old version.

Trade-offs
✓ Pros
Reduces risk by limiting exposure of new code to a small user base initially.
Allows real-time monitoring and quick rollback if issues are detected.
Enables gradual performance and stability validation in production.
✗ Cons
Requires sophisticated traffic routing and monitoring infrastructure.
Can increase operational complexity and deployment time.
May cause inconsistent user experience during rollout.
Use when deploying critical services with high user impact and when you have monitoring and automated rollback capabilities. Suitable for systems with at least thousands of users to benefit from gradual rollout.
Avoid when user base is very small (under hundreds) or when deployment speed is critical and risk tolerance is high. Also not ideal if monitoring and rollback mechanisms are not in place.
Real World Examples
Netflix
Netflix uses canary deployments to release new streaming service features to a small percentage of users first, ensuring stability before full rollout.
Uber
Uber deploys new versions of its ride-matching service to a subset of drivers and riders to monitor performance and prevent widespread disruption.
Amazon
Amazon uses canary deployments for its e-commerce backend services to minimize risk during frequent updates and maintain high availability.
Code Example
The before code deploys the new version to all instances simultaneously, risking full outage. The after code deploys first to 10% of instances (canary), monitors their health, and only proceeds if they are healthy. Otherwise, it rolls back the canary instances.
Microservices
### Before (No Canary Deployment) ###
class ServiceDeployer:
    def __init__(self, instances):
        self.instances = instances

    def deploy(self, version):
        # Deploy new version to all instances at once
        for instance in self.instances:
            instance.update(version)

### After (With Canary Deployment) ###
class ServiceDeployer:
    def __init__(self, instances):
        self.instances = instances

    def deploy(self, version):
        # Deploy new version to canary instances only
        canary_instances = self.instances[:int(len(self.instances)*0.1)]
        for instance in canary_instances:
            instance.update(version)
        # Monitor canary instances
        if self.monitor_canary():
            # Deploy to remaining instances
            for instance in self.instances[int(len(self.instances)*0.1):]:
                instance.update(version)
        else:
            self.rollback(canary_instances)

    def monitor_canary(self):
        # Simplified monitoring logic
        return all(instance.is_healthy() for instance in self.instances[:int(len(self.instances)*0.1)])

    def rollback(self, instances):
        for instance in instances:
            instance.rollback()
OutputSuccess
Alternatives
Blue-Green Deployment
Deploys new version to a separate environment and switches all traffic at once, rather than gradual rollout.
Use when: Choose when you want instant rollback and can afford duplicate environments.
Rolling Deployment
Updates instances one by one without splitting traffic by user groups, unlike canary which targets a subset of users.
Use when: Choose when gradual instance replacement is sufficient and user segmentation is not needed.
Summary
Canary deployment reduces risk by releasing new versions to a small subset of users first.
It requires monitoring and rollback mechanisms to ensure stability before full rollout.
This pattern is ideal for large-scale systems where gradual validation is critical.

Practice

(1/5)
1. What is the main purpose of a canary deployment in microservices?
easy
A. To permanently run two versions side by side
B. To deploy all users to a new version at once
C. To release a new version to a small group of users first to reduce risk
D. To test the new version only in a development environment

Solution

  1. Step 1: Understand the goal of canary deployment

    Canary deployment aims to reduce risk by releasing new software versions to a small subset of users first.
  2. Step 2: Compare options with this goal

    To release a new version to a small group of users first to reduce risk matches this goal exactly, while others describe different deployment strategies.
  3. Final Answer:

    To release a new version to a small group of users first to reduce risk -> Option C
  4. Quick Check:

    Canary deployment = gradual rollout [OK]
Hint: Canary means small test group first, not all users [OK]
Common Mistakes:
  • Confusing canary with blue-green deployment
  • Thinking canary deploys to all users at once
  • Assuming canary is only for testing environments
2. Which of the following is the correct way to control traffic during a canary deployment?
easy
A. Send 100% of traffic to the new version immediately
B. Route a small percentage of traffic to the new version and the rest to the old
C. Stop all traffic during deployment
D. Send traffic randomly without control

Solution

  1. Step 1: Understand traffic control in canary deployment

    Traffic is gradually shifted to the new version to monitor its behavior safely.
  2. Step 2: Identify the correct traffic routing method

    Route a small percentage of traffic to the new version and the rest to the old describes routing a small percentage to the new version while keeping most on the old version, which is correct.
  3. Final Answer:

    Route a small percentage of traffic to the new version and the rest to the old -> Option B
  4. Quick Check:

    Traffic control = gradual routing [OK]
Hint: Gradually shift traffic, never 100% at once [OK]
Common Mistakes:
  • Sending all traffic immediately to new version
  • Stopping traffic completely during deployment
  • Ignoring traffic routing control
3. Consider this simplified code snippet for traffic routing in a canary deployment:
def route_request(user_id):
    if user_id % 10 == 0:
        return "new_version"
    else:
        return "old_version"

print(route_request(20))
print(route_request(23))
What will be the output?
medium
A. "new_version" followed by "old_version"
B. "new_version" followed by "new_version"
C. "old_version" followed by "old_version"
D. "old_version" followed by "new_version"

Solution

  1. Step 1: Evaluate route_request(20)

    20 % 10 equals 0, so it returns "new_version".
  2. Step 2: Evaluate route_request(23)

    23 % 10 equals 3, not 0, so it returns "old_version".
  3. Final Answer:

    "new_version" followed by "old_version" -> Option A
  4. Quick Check:

    Modulo 10 == 0 routes to new version [OK]
Hint: Check modulo condition carefully for routing [OK]
Common Mistakes:
  • Misunderstanding modulo operator
  • Assuming all users go to new version
  • Mixing output order
4. A team implemented a canary deployment but noticed that 100% of users are routed to the new version immediately. What is the most likely cause?
medium
A. Traffic routing logic sends all traffic to new version without percentage control
B. Monitoring tools are not enabled
C. Rollback was triggered accidentally
D. Old version servers are down

Solution

  1. Step 1: Analyze the symptom

    All users routed to new version immediately means no gradual traffic control.
  2. Step 2: Identify the cause

    Traffic routing logic sends all traffic to new version without percentage control explains that routing logic lacks percentage control, causing full traffic shift.
  3. Final Answer:

    Traffic routing logic sends all traffic to new version without percentage control -> Option A
  4. Quick Check:

    Immediate full traffic = missing gradual routing [OK]
Hint: Check traffic routing code for percentage control [OK]
Common Mistakes:
  • Blaming monitoring tools for routing issues
  • Assuming rollback causes full traffic shift
  • Ignoring server status impact
5. You want to design a canary deployment system that automatically rolls back if error rates exceed 5% during rollout. Which combination of components is essential?
hard
A. Load balancer, static routing, manual rollback process
B. Manual deployment script, user feedback form, database backup
C. Continuous integration server, code linter, version control
D. Traffic router, monitoring system, automated rollback controller

Solution

  1. Step 1: Identify components for traffic control and monitoring

    A traffic router directs user requests between old and new versions; monitoring system tracks error rates.
  2. Step 2: Include automated rollback for quick response

    An automated rollback controller triggers rollback if error thresholds are exceeded.
  3. Final Answer:

    Traffic router, monitoring system, automated rollback controller -> Option D
  4. Quick Check:

    Canary needs routing + monitoring + rollback [OK]
Hint: Combine routing, monitoring, and rollback for safe canary [OK]
Common Mistakes:
  • Ignoring automation in rollback
  • Confusing deployment tools with monitoring
  • Missing traffic routing component