0
0
HLDsystem_design~25 mins

Alerting thresholds in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Alerting Thresholds System
Design covers threshold configuration, alert detection, notification delivery, and dashboard. Does not cover metric collection or storage in detail.
Functional Requirements
FR1: Allow users to define alert thresholds for various metrics (e.g., CPU usage, memory, error rates).
FR2: Support multiple types of thresholds: static values, percentage changes, and rate of change.
FR3: Trigger alerts when thresholds are crossed and notify users via email or SMS.
FR4: Allow users to set different thresholds for different time periods (e.g., business hours vs off-hours).
FR5: Support grouping of alerts by service or application.
FR6: Provide a dashboard to view current alerts and threshold configurations.
FR7: Ensure alerts are generated with low latency (p99 < 1 second after threshold breach).
Non-Functional Requirements
NFR1: System must handle monitoring data from up to 10,000 services sending metrics every minute.
NFR2: Availability target of 99.9% uptime (less than 8.77 hours downtime per year).
NFR3: System should scale horizontally to handle increased load.
NFR4: Alert notifications must be reliable with retry mechanisms.
NFR5: Data retention for threshold configurations and alert history should be at least 90 days.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Threshold Configuration Service
Metrics Ingestion and Processing Pipeline
Alert Evaluation Engine
Notification Service
User Dashboard and API
Database for configurations and alert history
Design Patterns
Event-driven architecture for alert evaluation
Publish-subscribe for notification delivery
Caching for threshold configurations
Rate limiting and backoff for notifications
Time-series data handling for metric trends
Reference Architecture
                    +---------------------+
                    |  User Dashboard/API  |
                    +----------+----------+
                               |
                               v
                +--------------+---------------+
                |   Threshold Configuration     |
                |          Service              |
                +--------------+---------------+
                               |
                               v
+----------------+     +-------+--------+     +----------------+
| Metrics Source | --> | Metrics Ingest | --> | Alert Engine   |
+----------------+     +----------------+     +-------+--------+
                                                       |
                                                       v
                                              +--------+---------+
                                              | Notification      |
                                              | Service          |
                                              +------------------+
Components
Threshold Configuration Service
REST API with database (PostgreSQL)
Manage user-defined alert thresholds and store configurations.
Metrics Ingestion and Processing Pipeline
Kafka + Stream Processing (Apache Flink or Spark Streaming)
Receive and process incoming metrics in real-time.
Alert Evaluation Engine
Stream processing with rules engine
Evaluate metrics against thresholds and generate alerts.
Notification Service
Microservice with email/SMS gateways (e.g., Twilio, SMTP)
Send alert notifications reliably with retries.
User Dashboard and API
Web frontend (React) + backend API
Allow users to view and manage thresholds and alerts.
Database
PostgreSQL for configurations and alert history
Store threshold settings and alert records with retention.
Request Flow
1. User sets or updates alert thresholds via the Dashboard/API.
2. Threshold Configuration Service validates and stores these settings in the database.
3. Metrics from services are sent continuously to the Metrics Ingestion Pipeline.
4. The pipeline processes metrics and forwards them to the Alert Evaluation Engine.
5. The Alert Evaluation Engine compares metrics against stored thresholds.
6. When a threshold is breached, an alert event is generated.
7. Alert events are sent to the Notification Service.
8. Notification Service sends alerts to users via email or SMS, retrying on failure.
9. Alerts and their statuses are stored in the database for dashboard display.
Database Schema
Entities: - User (id, name, contact_info) - Threshold (id, user_id, metric_name, threshold_type, threshold_value, time_period, service_group) - Alert (id, threshold_id, metric_value, timestamp, status) Relationships: - User 1:N Threshold - Threshold 1:N Alert - Alerts linked to specific metric and service group for filtering
Scaling Discussion
Bottlenecks
High volume of incoming metrics causing ingestion delays.
Alert Evaluation Engine becoming CPU-bound with many thresholds.
Notification Service overwhelmed by large alert bursts.
Database write/read bottlenecks for storing configurations and alerts.
Solutions
Partition metrics ingestion by service groups and scale Kafka clusters horizontally.
Distribute alert evaluation across multiple nodes with sharding by metric or service.
Implement notification rate limiting, batching, and use scalable third-party gateways.
Use database sharding, read replicas, and caching for threshold configurations.
Interview Tips
Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Clarify types of thresholds and alerting needs early.
Explain how real-time processing enables low latency alerts.
Discuss reliability in notification delivery with retries.
Highlight database design supporting retention and querying.
Address scaling challenges and horizontal scaling strategies.
Mention user experience via dashboard and API.