Bird
Raised Fist0
MLOpsdevops~10 mins

Canary releases for model updates in MLOps - Step-by-Step Execution

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Process Flow - Canary releases for model updates
Start with current stable model
Deploy new model to small % of users
Monitor performance and errors
Good
Increase %
Full rollout
This flow shows how a new model is released to a small group first, monitored, then either fully rolled out or rolled back based on results.
Execution Sample
MLOps
deploy_model(version='v2', traffic=10)
monitor_metrics()
if metrics_good:
  increase_traffic(50)
else:
  rollback_to('v1')
This code deploys a new model to 10% of users, monitors it, then increases traffic or rolls back based on metrics.
Process Table
StepActionTraffic % to new modelMetrics StatusDecisionResult
1Deploy new model v210%PendingWaitNew model serving 10% users
2Monitor metrics10%GoodIncrease trafficPrepare to increase rollout
3Increase traffic to 50%50%PendingWaitNew model serving 50% users
4Monitor metrics50%GoodFull rolloutPrepare full rollout
5Increase traffic to 100%100%PendingWaitNew model serving all users
6Monitor metrics100%GoodCompleteNew model fully deployed
7End process100%GoodStopDeployment successful
💡 Deployment ends after full rollout with good metrics or rollback if metrics were bad
Status Tracker
VariableStartAfter Step 1After Step 3After Step 5Final
traffic_percent0%10%50%100%100%
metrics_statusN/APendingPendingPendingGood
deployment_statestable v1canary v2canary v2full v2full v2
Key Moments - 3 Insights
Why do we start with only a small percentage of traffic to the new model?
Starting small limits risk. If the new model has issues, only a few users are affected. See execution_table step 1 where traffic is 10%.
What happens if the metrics are not good during monitoring?
If metrics are bad, the deployment is rolled back to the stable model to avoid impacting users. This is implied in the flow after monitoring steps.
Why do we increase traffic gradually instead of all at once?
Gradual increase helps catch problems early and ensures stability before full rollout. Execution_table steps 3 and 5 show traffic increasing stepwise.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the traffic percentage to the new model at step 3?
A50%
B100%
C10%
D0%
💡 Hint
Check the 'Traffic % to new model' column at step 3 in the execution_table.
At which step does the new model start serving all users?
AStep 1
BStep 3
CStep 5
DStep 7
💡 Hint
Look for 100% traffic in the 'Traffic % to new model' column in the execution_table.
If metrics were bad at step 2, what would be the expected action?
AIncrease traffic to 50%
BRollback to stable model
CContinue monitoring without changes
DDeploy another new model
💡 Hint
Refer to the key_moments section about what happens if metrics are bad during monitoring.
Concept Snapshot
Canary releases deploy a new model to a small user group first.
Monitor performance carefully.
If good, increase traffic gradually.
If bad, rollback immediately.
This reduces risk and ensures smooth updates.
Full Transcript
Canary releases for model updates start by deploying the new model to a small percentage of users. This limits risk if the new model has issues. We then monitor key metrics like accuracy and errors. If metrics are good, we increase the traffic percentage step by step, watching performance at each stage. If metrics become bad at any point, we rollback to the stable model to protect users. This process continues until the new model serves all users or is rolled back. The execution table shows each step with traffic percentages and decisions. Variables like traffic_percent and deployment_state change as the rollout progresses. Key moments include why we start small, what happens on bad metrics, and why gradual rollout is important. The visual quiz tests understanding of these steps and decisions. This method helps update models safely and reliably.

Practice

(1/5)
1. What is the main purpose of a canary release when updating machine learning models?
easy
A. To train the model faster using more data
B. To immediately replace the old model with the new one for all users
C. To test the new model on a small group of users before full deployment
D. To reduce the size of the model for faster inference

Solution

  1. Step 1: Understand canary release concept

    Canary releases deploy a new model to a small subset of users first to test its performance safely.
  2. Step 2: Compare options

    Only To test the new model on a small group of users before full deployment describes testing on a small group before full rollout, which is the main purpose.
  3. Final Answer:

    To test the new model on a small group of users before full deployment -> Option C
  4. Quick Check:

    Canary release = small group test [OK]
Hint: Canary means small test group before full rollout [OK]
Common Mistakes:
  • Thinking canary releases replace models immediately
  • Confusing canary with model training speed
  • Assuming canary reduces model size
2. Which of the following is the correct way to specify 10% traffic to a new model version in a deployment configuration?
easy
A. "traffic_split": {"new_model": 10, "old_model": 90}
B. "traffic_split": {"new_model": 0.1, "old_model": 0.9}
C. "traffic_split": {"new_model": "10%", "old_model": "90%"}
D. "traffic_split": {"new_model": 1, "old_model": 9}

Solution

  1. Step 1: Understand traffic split format

    Traffic splits are usually specified as fractions summing to 1.0, representing percentages as decimals.
  2. Step 2: Evaluate options

    "traffic_split": {"new_model": 0.1, "old_model": 0.9} uses decimal fractions (0.1 and 0.9) correctly. "traffic_split": {"new_model": 10, "old_model": 90} uses integers but not fractions. "traffic_split": {"new_model": "10%", "old_model": "90%"} uses strings with percent signs, which is invalid syntax. "traffic_split": {"new_model": 1, "old_model": 9} sums to 10, not 1.
  3. Final Answer:

    "traffic_split": {"new_model": 0.1, "old_model": 0.9} -> Option B
  4. Quick Check:

    Traffic split decimals sum to 1 [OK]
Hint: Use decimals summing to 1 for traffic percentages [OK]
Common Mistakes:
  • Using integers instead of decimals for traffic split
  • Including percent signs in values
  • Traffic splits not summing to 1
3. Given this simplified code snippet for routing traffic in a canary release:
def route_request(user_id):
    if user_id % 10 == 0:
        return "new_model"
    else:
        return "old_model"

print(route_request(20))
print(route_request(23))

What will be the output?
medium
A. new_model\nold_model
B. old_model\nnew_model
C. new_model\nnew_model
D. old_model\nold_model

Solution

  1. Step 1: Analyze routing logic

    The function sends users with user_id divisible by 10 to the new model, others to old model.
  2. Step 2: Evaluate given user_ids

    For user_id 20: 20 % 10 == 0, so returns "new_model". For user_id 23: 23 % 10 == 3, so returns "old_model".
  3. Final Answer:

    new_model old_model -> Option A
  4. Quick Check:

    Divisible by 10 = new_model [OK]
Hint: Check modulo condition for routing [OK]
Common Mistakes:
  • Misunderstanding modulo operator
  • Swapping outputs for user IDs
  • Assuming all users get new model
4. You deployed a canary release but noticed the new model is receiving 100% of traffic instead of 10%. Which fix will correct this issue?
medium
A. Change traffic split from {"new_model": 1, "old_model": 0} to {"new_model": 0.1, "old_model": 0.9}
B. Increase the new model traffic to 50% to balance load
C. Restart the deployment without changing traffic split
D. Remove the old model from deployment

Solution

  1. Step 1: Identify traffic split error

    Current split {"new_model": 1, "old_model": 0} sends all traffic to new model, causing 100% traffic.
  2. Step 2: Correct traffic split values

    Setting split to {"new_model": 0.1, "old_model": 0.9} correctly routes 10% traffic to new model and 90% to old model.
  3. Final Answer:

    Change traffic split from {"new_model": 1, "old_model": 0} to {"new_model": 0.1, "old_model": 0.9} -> Option A
  4. Quick Check:

    Traffic split controls user percentage [OK]
Hint: Check traffic split decimals sum to 1 [OK]
Common Mistakes:
  • Restarting without fixing traffic split
  • Increasing new model traffic without reason
  • Removing old model prematurely
5. You want to safely update a model with a canary release. The new model shows better accuracy but higher latency. What is the best approach to decide whether to proceed with full rollout?
hard
A. Deploy new model only to internal users without monitoring
B. Ignore latency since accuracy is more important; rollout immediately
C. Increase traffic to new model to 100% to gather more data quickly
D. Monitor both accuracy and latency metrics during canary; rollback if latency impact is unacceptable

Solution

  1. Step 1: Understand trade-offs in canary release

    Canary releases test new model performance including accuracy and latency to ensure overall user experience.
  2. Step 2: Choose monitoring and rollback strategy

    Monitoring both metrics allows informed decision; rollback if latency harms user experience despite accuracy gains.
  3. Final Answer:

    Monitor both accuracy and latency metrics during canary; rollback if latency impact is unacceptable -> Option D
  4. Quick Check:

    Balance metrics and rollback if needed [OK]
Hint: Watch all key metrics before full rollout [OK]
Common Mistakes:
  • Ignoring latency impact
  • Rushing full rollout without monitoring
  • Skipping rollback plans