0
0
HLDsystem_design~7 mins

Alerting thresholds in HLD - System Design Guide

Choose your learning style9 modes available
Problem Statement
Without properly set alerting thresholds, systems either flood engineers with too many false alarms or miss critical failures. This leads to alert fatigue or delayed responses, causing downtime or degraded user experience.
Solution
Alerting thresholds define specific limits on system metrics that trigger alerts only when meaningful anomalies occur. By tuning these limits based on normal behavior and business impact, the system balances sensitivity and noise, ensuring timely and actionable notifications.
Architecture
Metrics
Collection
Threshold
Alert Rules
Alert Rules

This diagram shows how collected metrics are evaluated against configured alerting thresholds to decide when to trigger alerts.

Trade-offs
✓ Pros
Reduces false positives by filtering out normal fluctuations.
Ensures critical issues are detected promptly.
Helps prioritize alerts based on severity and impact.
Enables customization for different system components and business needs.
✗ Cons
Requires careful tuning and ongoing adjustment as system behavior changes.
Improper thresholds can cause missed alerts or alert storms.
Complex systems may need multiple thresholds, increasing configuration overhead.
Use when monitoring systems with variable workloads and critical uptime requirements, especially at scale above thousands of requests per second or complex distributed architectures.
Avoid when system metrics are stable and predictable with very low variability, or when alerting is handled manually or by simple fixed rules without automation.
Real World Examples
Netflix
Netflix uses dynamic alerting thresholds to detect streaming quality degradation without triggering alerts for normal traffic spikes.
Uber
Uber applies adaptive alert thresholds to monitor ride request latency, adjusting alerts based on time of day and regional traffic patterns.
Google
Google’s Site Reliability Engineering teams use alerting thresholds tuned per service to balance noise and signal in their massive infrastructure.
Alternatives
Anomaly Detection Alerts
Uses machine learning models to detect unusual patterns instead of fixed thresholds.
Use when: Choose when system behavior is complex and non-linear, making static thresholds ineffective.
Heartbeat Monitoring
Alerts based on missing periodic signals rather than metric thresholds.
Use when: Choose when monitoring service availability or liveness rather than performance metrics.
Summary
Alerting thresholds prevent overload of false alarms and missed critical issues by defining when alerts trigger.
They require careful tuning and ongoing adjustment to balance sensitivity and noise.
Proper thresholds improve incident response and system reliability at scale.