Design: Alerting Thresholds System
Design covers threshold configuration, alert detection, notification delivery, and dashboard. Does not cover metric collection or storage in detail.
Functional Requirements
FR1: Allow users to define alert thresholds for various metrics (e.g., CPU usage, memory, error rates).
FR2: Support multiple types of thresholds: static values, percentage changes, and rate of change.
FR3: Trigger alerts when thresholds are crossed and notify users via email or SMS.
FR4: Allow users to set different thresholds for different time periods (e.g., business hours vs off-hours).
FR5: Support grouping of alerts by service or application.
FR6: Provide a dashboard to view current alerts and threshold configurations.
FR7: Ensure alerts are generated with low latency (p99 < 1 second after threshold breach).
Non-Functional Requirements
NFR1: System must handle monitoring data from up to 10,000 services sending metrics every minute.
NFR2: Availability target of 99.9% uptime (less than 8.77 hours downtime per year).
NFR3: System should scale horizontally to handle increased load.
NFR4: Alert notifications must be reliable with retry mechanisms.
NFR5: Data retention for threshold configurations and alert history should be at least 90 days.