Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Platform Observability and SLAs
📖 Scenario: You work as a DevOps engineer for a machine learning platform team. Your team wants to monitor the platform's health by tracking service uptime and response times. They also want to check if the platform meets the agreed Service Level Agreements (SLAs).SLAs require the platform to have at least 99% uptime and average response time below 200 milliseconds.
🎯 Goal: Build a simple Python script that stores platform metrics, sets SLA thresholds, calculates uptime and average response time, and prints whether the platform meets the SLAs.
📋 What You'll Learn
Create a dictionary with exact platform metrics data
Add SLA threshold variables for uptime and response time
Calculate uptime percentage and average response time using loops
Print the SLA compliance results exactly as specified
💡 Why This Matters
🌍 Real World
Monitoring platform health and ensuring it meets SLAs is critical for reliable machine learning services.
💼 Career
DevOps engineers and MLOps specialists use observability and SLA checks daily to maintain service quality.
Progress0 / 4 steps
1
Create platform metrics data
Create a dictionary called platform_metrics with these exact entries: 'uptime_minutes': [1440, 1430, 1420, 1440, 1435] and 'response_times_ms': [180, 210, 190, 170, 200].
MLOps
Hint
Use a dictionary with two keys: 'uptime_minutes' and 'response_times_ms'. Each key should have a list of integers as values.
2
Add SLA threshold variables
Add two variables: sla_uptime_threshold set to 99.0 and sla_response_time_threshold set to 200.
MLOps
Hint
Set sla_uptime_threshold to 99.0 (percent) and sla_response_time_threshold to 200 (milliseconds).
3
Calculate uptime percentage and average response time
Calculate the total possible uptime minutes as 1440 * 5. Calculate the actual uptime by summing platform_metrics['uptime_minutes']. Calculate uptime_percentage as (actual uptime / total possible uptime) * 100. Calculate average_response_time as the average of platform_metrics['response_times_ms']. Use for loops with variables minute and time to sum the lists.
MLOps
Hint
Use for loops to sum the uptime and response times. Then calculate percentages and averages.
4
Print SLA compliance results
Print two lines exactly as follows: print(f"Uptime meets SLA: {uptime_percentage >= sla_uptime_threshold}") and print(f"Response time meets SLA: {average_response_time <= sla_response_time_threshold}").
MLOps
Hint
Use print statements with f-strings to show if uptime and response time meet SLA thresholds.
Practice
(1/5)
1. What is the main purpose of platform observability in MLOps?
easy
A. To monitor and understand system performance in real time
B. To set legal contracts with users
C. To deploy machine learning models automatically
D. To store large amounts of data efficiently
Solution
Step 1: Understand observability concept
Observability means seeing how the system behaves and performs live.
Step 2: Match purpose with options
Only To monitor and understand system performance in real time talks about monitoring and understanding performance in real time.
Final Answer:
To monitor and understand system performance in real time -> Option A
What will be the alert message if error_rate is 0.03?
medium
A. No alert
B. High error rate
C. Error rate normal
D. Syntax error
Solution
Step 1: Evaluate the condition with error_rate = 0.03
0.03 is less than 0.05, so the condition error_rate > 0.05 is false.
Step 2: Determine which alert triggers
Since condition is false, the else branch runs, triggering alert('Error rate normal').
Final Answer:
Error rate normal -> Option C
Quick Check:
0.03 < 0.05 triggers else alert [OK]
Hint: Check if error_rate exceeds threshold [OK]
Common Mistakes:
Confusing greater than with less than
Assuming no alert triggers
Thinking code has syntax error
4. You have this SLA configuration:
sla:
uptime: '99.95%'
response_time_ms: 200
But your monitoring shows frequent alerts for response time exceeding 200ms. What is the most likely cause?
medium
A. The uptime percentage is incorrect
B. The SLA response_time_ms is set too low for actual system performance
C. The SLA syntax is invalid YAML
D. The monitoring tool is not running
Solution
Step 1: Analyze SLA and alert mismatch
The SLA sets response_time_ms to 200ms, but alerts show it often exceeds this.
Step 2: Identify cause of frequent alerts
This means the system often responds slower than 200ms, so SLA is too strict or system needs improvement.
Final Answer:
The SLA response_time_ms is set too low for actual system performance -> Option B
Quick Check:
Strict SLA causes frequent alerts [OK]
Hint: Check if SLA limits match real system speed [OK]
Common Mistakes:
Blaming uptime for response time alerts
Assuming YAML syntax error without checking
Ignoring monitoring tool status
5. You want to combine observability metrics and SLA checks to alert only when uptime drops below 99.9% and error rate exceeds 1%. Which pseudo-code correctly implements this?
hard
A. if uptime >= 99.9 and error_rate >= 0.01:
alert('SLA breach')
B. if uptime > 99.9 or error_rate < 0.01:
alert('SLA breach')
C. if uptime <= 99.9 and error_rate <= 0.01:
alert('SLA breach')
D. if uptime < 99.9 and error_rate > 0.01:
alert('SLA breach')
Solution
Step 1: Understand SLA breach conditions
SLA breach means uptime is less than 99.9% AND error rate is greater than 1% (0.01).
Step 2: Match condition logic with options
if uptime < 99.9 and error_rate > 0.01:
alert('SLA breach') uses < for uptime and > for error rate combined with AND, matching the requirement exactly.
Final Answer:
if uptime < 99.9 and error_rate > 0.01:\n alert('SLA breach') -> Option D
Quick Check:
Use AND with correct inequalities for SLA breach [OK]
Hint: Use AND with uptime < 99.9 and error_rate > 0.01 [OK]