Bird
Raised Fist0
Microservicessystem_design~25 mins

Canary deployment in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Canary Deployment System for Microservices
Design focuses on the deployment and traffic routing system for canary releases in microservices architecture. It excludes detailed CI/CD pipeline internals and microservice business logic.
Functional Requirements
FR1: Deploy new versions of microservices gradually to a small subset of users before full rollout
FR2: Monitor key metrics (errors, latency, user feedback) during canary phase
FR3: Automatically rollback if metrics degrade beyond threshold
FR4: Support routing a configurable percentage of traffic to canary version
FR5: Allow manual promotion or rollback of canary to production
FR6: Integrate with existing CI/CD pipelines
FR7: Provide visibility and logs for deployment status
Non-Functional Requirements
NFR1: Handle up to 100,000 concurrent users during deployment
NFR2: API response latency p99 under 300ms during deployment
NFR3: Availability target 99.9% uptime during deployment
NFR4: Support zero downtime deployment
NFR5: Secure routing and access control for deployment management
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
API Gateway or Service Mesh for traffic routing
Monitoring and alerting system
Deployment orchestrator or CI/CD integration
Configuration management for routing rules
Logging and audit trail system
Design Patterns
Blue-Green Deployment
Feature Flags
Circuit Breaker for fault tolerance
Health Checks and Metrics Collection
Progressive Delivery
Reference Architecture
                +---------------------+
                |  Deployment Manager  |
                +----------+----------+
                           |
                           | Deployment commands
                           v
+----------------+    +----------------+    +----------------+
|  API Gateway / |<-->|  Service Mesh  |<-->|  Microservices  |
|  Load Balancer |    +----------------+    +----------------+
+-------+--------+           |  ^
        |                    |  |
        | Traffic Routing     |  | Metrics Collection
        v                    |  |
+----------------+           |  |
| Monitoring &    |<----------+  |
| Alerting System |              |
+----------------+              |
                                |
                      +---------------------+
                      | Configuration Store  |
                      +---------------------+
Components
Deployment Manager
Kubernetes, Jenkins, or custom orchestrator
Controls deployment lifecycle, triggers canary rollout, rollback, and promotion
API Gateway / Load Balancer
NGINX, Envoy, or cloud load balancer
Routes incoming user traffic between stable and canary versions based on configured percentages
Service Mesh
Istio, Linkerd
Manages fine-grained traffic routing, telemetry, and fault injection for microservices
Monitoring & Alerting System
Prometheus, Grafana, Alertmanager
Collects metrics like error rates, latency; triggers alerts on anomalies
Configuration Store
Consul, Etcd, or ConfigMaps
Stores routing rules and deployment states for dynamic updates
Microservices
Docker containers, Kubernetes pods
Business logic components deployed in stable and canary versions
Request Flow
1. 1. Deployment Manager triggers a new canary version deployment of a microservice.
2. 2. The new version is deployed alongside the stable version in the cluster.
3. 3. Configuration Store updates routing rules to send a small percentage (e.g., 5%) of traffic to the canary version.
4. 4. API Gateway or Service Mesh routes user requests according to updated rules.
5. 5. Monitoring system collects metrics (latency, errors) from both versions.
6. 6. If metrics are within acceptable thresholds, Deployment Manager gradually increases traffic to canary.
7. 7. If metrics degrade, Deployment Manager triggers automatic rollback to stable version.
8. 8. Once canary is validated, Deployment Manager promotes canary to stable and updates routing to 100%.
9. 9. Logs and deployment status are recorded for audit and visibility.
Database Schema
Entities: - Deployment: id, microservice_name, version, status (canary, stable, rolled_back), start_time, end_time - RoutingRule: id, microservice_name, version, traffic_percentage - Metrics: id, deployment_id, timestamp, error_rate, latency_ms, user_feedback_score - Alert: id, deployment_id, metric_type, threshold, triggered_time, resolved_time Relationships: - Deployment 1:N Metrics - Deployment 1:N Alerts - Deployment 1:1 RoutingRule (current active routing)
Scaling Discussion
Bottlenecks
API Gateway or Service Mesh becomes a traffic routing bottleneck under high load
Monitoring system overwhelmed by high volume of metrics data
Configuration Store latency impacts routing updates speed
Deployment Manager delays due to complex rollback or promotion logic
Solutions
Scale API Gateway horizontally with load balancers; use efficient routing algorithms
Use sampling and aggregation in monitoring to reduce data volume; employ scalable TSDBs
Use distributed, highly available configuration stores with caching
Optimize Deployment Manager logic; use asynchronous workflows and parallel checks
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying questions, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain gradual traffic shifting and importance of monitoring during canary
Discuss integration with service mesh or API gateway for routing
Highlight rollback automation and zero downtime deployment
Mention metrics to monitor and alerting strategies
Address scaling challenges and solutions
Show awareness of operational visibility and audit trails

Practice

(1/5)
1. What is the main purpose of a canary deployment in microservices?
easy
A. To permanently run two versions side by side
B. To deploy all users to a new version at once
C. To release a new version to a small group of users first to reduce risk
D. To test the new version only in a development environment

Solution

  1. Step 1: Understand the goal of canary deployment

    Canary deployment aims to reduce risk by releasing new software versions to a small subset of users first.
  2. Step 2: Compare options with this goal

    To release a new version to a small group of users first to reduce risk matches this goal exactly, while others describe different deployment strategies.
  3. Final Answer:

    To release a new version to a small group of users first to reduce risk -> Option C
  4. Quick Check:

    Canary deployment = gradual rollout [OK]
Hint: Canary means small test group first, not all users [OK]
Common Mistakes:
  • Confusing canary with blue-green deployment
  • Thinking canary deploys to all users at once
  • Assuming canary is only for testing environments
2. Which of the following is the correct way to control traffic during a canary deployment?
easy
A. Send 100% of traffic to the new version immediately
B. Route a small percentage of traffic to the new version and the rest to the old
C. Stop all traffic during deployment
D. Send traffic randomly without control

Solution

  1. Step 1: Understand traffic control in canary deployment

    Traffic is gradually shifted to the new version to monitor its behavior safely.
  2. Step 2: Identify the correct traffic routing method

    Route a small percentage of traffic to the new version and the rest to the old describes routing a small percentage to the new version while keeping most on the old version, which is correct.
  3. Final Answer:

    Route a small percentage of traffic to the new version and the rest to the old -> Option B
  4. Quick Check:

    Traffic control = gradual routing [OK]
Hint: Gradually shift traffic, never 100% at once [OK]
Common Mistakes:
  • Sending all traffic immediately to new version
  • Stopping traffic completely during deployment
  • Ignoring traffic routing control
3. Consider this simplified code snippet for traffic routing in a canary deployment:
def route_request(user_id):
    if user_id % 10 == 0:
        return "new_version"
    else:
        return "old_version"

print(route_request(20))
print(route_request(23))
What will be the output?
medium
A. "new_version" followed by "old_version"
B. "new_version" followed by "new_version"
C. "old_version" followed by "old_version"
D. "old_version" followed by "new_version"

Solution

  1. Step 1: Evaluate route_request(20)

    20 % 10 equals 0, so it returns "new_version".
  2. Step 2: Evaluate route_request(23)

    23 % 10 equals 3, not 0, so it returns "old_version".
  3. Final Answer:

    "new_version" followed by "old_version" -> Option A
  4. Quick Check:

    Modulo 10 == 0 routes to new version [OK]
Hint: Check modulo condition carefully for routing [OK]
Common Mistakes:
  • Misunderstanding modulo operator
  • Assuming all users go to new version
  • Mixing output order
4. A team implemented a canary deployment but noticed that 100% of users are routed to the new version immediately. What is the most likely cause?
medium
A. Traffic routing logic sends all traffic to new version without percentage control
B. Monitoring tools are not enabled
C. Rollback was triggered accidentally
D. Old version servers are down

Solution

  1. Step 1: Analyze the symptom

    All users routed to new version immediately means no gradual traffic control.
  2. Step 2: Identify the cause

    Traffic routing logic sends all traffic to new version without percentage control explains that routing logic lacks percentage control, causing full traffic shift.
  3. Final Answer:

    Traffic routing logic sends all traffic to new version without percentage control -> Option A
  4. Quick Check:

    Immediate full traffic = missing gradual routing [OK]
Hint: Check traffic routing code for percentage control [OK]
Common Mistakes:
  • Blaming monitoring tools for routing issues
  • Assuming rollback causes full traffic shift
  • Ignoring server status impact
5. You want to design a canary deployment system that automatically rolls back if error rates exceed 5% during rollout. Which combination of components is essential?
hard
A. Load balancer, static routing, manual rollback process
B. Manual deployment script, user feedback form, database backup
C. Continuous integration server, code linter, version control
D. Traffic router, monitoring system, automated rollback controller

Solution

  1. Step 1: Identify components for traffic control and monitoring

    A traffic router directs user requests between old and new versions; monitoring system tracks error rates.
  2. Step 2: Include automated rollback for quick response

    An automated rollback controller triggers rollback if error thresholds are exceeded.
  3. Final Answer:

    Traffic router, monitoring system, automated rollback controller -> Option D
  4. Quick Check:

    Canary needs routing + monitoring + rollback [OK]
Hint: Combine routing, monitoring, and rollback for safe canary [OK]
Common Mistakes:
  • Ignoring automation in rollback
  • Confusing deployment tools with monitoring
  • Missing traffic routing component