Bird
Raised Fist0
MLOpsdevops~5 mins

Alert thresholds and policies in MLOps - Commands & Configuration

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Alert thresholds and policies help you get notified when something important happens in your machine learning system. They watch key numbers and send alerts if those numbers go too high or too low, so you can fix problems quickly.
When you want to know if your model's accuracy drops below a certain level after deployment
When you need to be alerted if the data input to your model changes unexpectedly
When you want to monitor resource usage like CPU or memory during model training and get notified if it exceeds limits
When you want to track if your model's prediction latency becomes too slow
When you want to automate responses to certain conditions by linking alerts to actions
Config File - alert_policy.yaml
alert_policy.yaml
alert_policies:
  - name: "Model Accuracy Drop"
    metric: "model_accuracy"
    threshold: 0.85
    comparison: "less_than"
    severity: "critical"
    notification_channels:
      - "email"
      - "slack"
  - name: "Data Drift Detected"
    metric: "input_data_drift"
    threshold: 0.1
    comparison: "greater_than"
    severity: "warning"
    notification_channels:
      - "email"
  - name: "High CPU Usage"
    metric: "cpu_usage"
    threshold: 80
    comparison: "greater_than"
    severity: "critical"
    notification_channels:
      - "pagerduty"

This YAML file defines alert policies for monitoring your ML system.

  • name: The alert's name for easy identification.
  • metric: The metric to watch, like model accuracy or CPU usage.
  • threshold: The value that triggers the alert.
  • comparison: How to compare the metric to the threshold (less_than or greater_than).
  • severity: How serious the alert is (warning or critical).
  • notification_channels: Where to send alerts, such as email, Slack, or PagerDuty.
Commands
This command creates alert policies in MLflow using the configuration file. It sets up the system to watch the specified metrics and send notifications when thresholds are crossed.
Terminal
mlflow alerts create --file alert_policy.yaml
Expected OutputExpected
Alert policies created successfully: Model Accuracy Drop, Data Drift Detected, High CPU Usage
--file - Specifies the alert policy configuration file to use
This command lists all active alert policies so you can verify they were created correctly.
Terminal
mlflow alerts list
Expected OutputExpected
ID Name Metric Threshold Comparison Severity 1 Model Accuracy Drop model_accuracy 0.85 less_than critical 2 Data Drift Detected input_data_drift 0.1 greater_than warning 3 High CPU Usage cpu_usage 80 greater_than critical
This command tests the alert policy named 'Model Accuracy Drop' by simulating a metric value of 0.80, which is below the threshold, to check if the alert triggers correctly.
Terminal
mlflow alerts test --name "Model Accuracy Drop" --metric-value 0.80
Expected OutputExpected
Alert triggered: Model Accuracy Drop (model_accuracy = 0.80 < 0.85) Severity: critical Notifications sent to: email, slack
--name - Specifies which alert policy to test
--metric-value - Simulates the metric value for testing the alert
Key Concept

If you remember nothing else from this pattern, remember: alert thresholds watch important metrics and notify you immediately when values cross set limits.

Common Mistakes
Setting thresholds too tight or too loose without testing
This causes too many false alerts or missed important issues, making alerts useless or annoying.
Test alert policies with realistic metric values and adjust thresholds to balance sensitivity and noise.
Not specifying notification channels correctly
Alerts won't reach the right people or systems, so problems go unnoticed.
Always include valid notification channels like email or Slack in the alert policy.
Forgetting to list or verify alert policies after creation
You might think alerts are active when they are not, missing critical notifications.
Use the list command to confirm alert policies are created and active.
Summary
Create alert policies using a YAML file that defines metrics, thresholds, and notification channels.
Use CLI commands to create, list, and test alert policies to ensure they work as expected.
Alert thresholds help catch problems early by notifying you when key metrics cross set limits.

Practice

(1/5)
1. What is the main purpose of setting an alert threshold in MLOps monitoring?
easy
A. To group multiple alerts into a single notification
B. To specify when a warning or alert should be triggered based on metric values
C. To define the actions taken after an alert is triggered
D. To store historical data of model performance

Solution

  1. Step 1: Understand alert threshold concept

    An alert threshold sets a limit on a metric value that, when crossed, triggers an alert.
  2. Step 2: Differentiate from policies and actions

    Policies group conditions and actions, but thresholds specifically define when alerts fire.
  3. Final Answer:

    To specify when a warning or alert should be triggered based on metric values -> Option B
  4. Quick Check:

    Alert threshold = trigger point [OK]
Hint: Thresholds set alert trigger points based on metrics [OK]
Common Mistakes:
  • Confusing thresholds with alert grouping
  • Thinking thresholds define actions
  • Assuming thresholds store data
2. Which of the following is the correct way to define an alert threshold for CPU usage exceeding 80% in a YAML policy?
easy
A. threshold: { metric: 'cpu_usage', operator: '>', value: 80 }
B. threshold: { metric: 'cpu_usage', operator: '<', value: 80 }
C. threshold: { metric: 'cpu_usage', operator: '=', value: 80 }
D. threshold: { metric: 'cpu_usage', operator: '!=', value: 80 }

Solution

  1. Step 1: Identify the correct operator for exceeding 80%

    Exceeding means greater than, so operator should be '>'.
  2. Step 2: Match metric and value correctly

    Metric is 'cpu_usage' and value is 80, so the syntax matches threshold: { metric: 'cpu_usage', operator: '>', value: 80 }.
  3. Final Answer:

    threshold: { metric: 'cpu_usage', operator: '>', value: 80 } -> Option A
  4. Quick Check:

    Exceeding 80% means operator '>' [OK]
Hint: Use '>' operator for thresholds exceeding a value [OK]
Common Mistakes:
  • Using '<' instead of '>' for exceeding
  • Using '=' which triggers only at exact value
  • Using '!=' which triggers for all except exact
3. Given this alert policy snippet:
thresholds:
  - metric: 'latency'
    operator: '>'
    value: 200
actions:
  - notify: 'on-call-team'

What happens when latency reaches 250?
medium
A. The alert triggers but no notification is sent
B. No alert is triggered because 250 is less than 200
C. An alert is triggered and the on-call team is notified
D. The system ignores latency metric

Solution

  1. Step 1: Analyze threshold condition

    The threshold triggers when latency > 200. Since 250 > 200, condition is met.
  2. Step 2: Check actions on trigger

    Action is to notify 'on-call-team', so notification will be sent.
  3. Final Answer:

    An alert is triggered and the on-call team is notified -> Option C
  4. Quick Check:

    Latency 250 > 200 triggers alert and notify [OK]
Hint: Check if metric value crosses threshold to trigger alerts [OK]
Common Mistakes:
  • Misreading operator direction
  • Ignoring actions linked to alerts
  • Assuming no notification without explicit command
4. You have this alert policy configuration:
thresholds:
  - metric: 'error_rate'
    operator: '>'
    value: 5
actions:
  - notify: 'dev-team'

But alerts never trigger even when error_rate is 10. What is the likely issue?
medium
A. The operator should be '<' instead of '>'
B. Notifications require a separate enable flag
C. The value 5 is too high to trigger alerts
D. The metric name might be misspelled or mismatched

Solution

  1. Step 1: Verify operator and value logic

    Operator '>' with value 5 means alert triggers if error_rate > 5, so 10 should trigger alert.
  2. Step 2: Check metric name correctness

    If alerts never trigger, a common cause is metric name mismatch or typo causing no data match.
  3. Final Answer:

    The metric name might be misspelled or mismatched -> Option D
  4. Quick Check:

    Metric name mismatch blocks alert triggers [OK]
Hint: Check metric names carefully if alerts don't trigger [OK]
Common Mistakes:
  • Changing operator incorrectly
  • Assuming threshold value is too high
  • Forgetting to enable notifications
5. You want to create a policy that triggers an alert if either model accuracy drops below 90% or latency exceeds 300ms. Which configuration correctly defines this combined alert policy?
hard
A. thresholds: - metric: 'accuracy' operator: '<' value: 90 - metric: 'latency' operator: '>' value: 300 actions: - notify: 'ml-team'
B. thresholds: - metric: 'accuracy' operator: '>' value: 90 - metric: 'latency' operator: '<' value: 300 actions: - notify: 'ml-team'
C. thresholds: - metric: 'accuracy' operator: '<' value: 90 condition: 'AND' - metric: 'latency' operator: '>' value: 300 actions: - notify: 'ml-team'
D. thresholds: - metric: 'accuracy' operator: '<' value: 90 condition: 'OR' - metric: 'latency' operator: '>' value: 300 actions: - notify: 'ml-team'

Solution

  1. Step 1: Identify correct operators for conditions

    Accuracy below 90% means operator '<', latency exceeding 300 means operator '>'.
  2. Step 2: Understand default logical grouping

    Most alert systems treat multiple thresholds as OR by default, so listing both triggers alert if either condition is met.
  3. Step 3: Verify options for logical conditions

    Configurations that include a condition key (like 'OR' or 'AND') under a threshold are typically not valid syntax. The configuration using operator '>' for accuracy and '<' for latency has incorrect operators.
  4. Final Answer:

    thresholds: - metric: 'accuracy' operator: '<' value: 90 - metric: 'latency' operator: '>' value: 300 actions: - notify: 'ml-team' -> Option A
  5. Quick Check:

    Correct operators + default OR logic = thresholds: - metric: 'accuracy' operator: '<' value: 90 - metric: 'latency' operator: '>' value: 300 actions: - notify: 'ml-team' [OK]
Hint: Use correct operators and list thresholds for OR logic [OK]
Common Mistakes:
  • Using wrong operators for conditions
  • Adding unsupported 'condition' keys
  • Assuming AND logic without explicit config