HLDsystem_design~25 mins

Alerting thresholds in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Alerting Thresholds System

Design covers threshold configuration, alert detection, notification delivery, and dashboard. Does not cover metric collection or storage in detail.

Functional Requirements

FR1: Allow users to define alert thresholds for various metrics (e.g., CPU usage, memory, error rates).

FR2: Support multiple types of thresholds: static values, percentage changes, and rate of change.

FR3: Trigger alerts when thresholds are crossed and notify users via email or SMS.

FR4: Allow users to set different thresholds for different time periods (e.g., business hours vs off-hours).

FR5: Support grouping of alerts by service or application.

FR6: Provide a dashboard to view current alerts and threshold configurations.

FR7: Ensure alerts are generated with low latency (p99 < 1 second after threshold breach).

Non-Functional Requirements

NFR1: System must handle monitoring data from up to 10,000 services sending metrics every minute.

NFR2: Availability target of 99.9% uptime (less than 8.77 hours downtime per year).

NFR3: System should scale horizontally to handle increased load.

NFR4: Alert notifications must be reliable with retry mechanisms.

NFR5: Data retention for threshold configurations and alert history should be at least 90 days.

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Threshold Configuration Service

Metrics Ingestion and Processing Pipeline

Alert Evaluation Engine

Notification Service

User Dashboard and API

Database for configurations and alert history

Design Patterns

Event-driven architecture for alert evaluation

Publish-subscribe for notification delivery

Caching for threshold configurations

Rate limiting and backoff for notifications

Time-series data handling for metric trends

Reference Architecture

                    +---------------------+
                    |  User Dashboard/API  |
                    +----------+----------+
                               |
                               v
                +--------------+---------------+
                |   Threshold Configuration     |
                |          Service              |
                +--------------+---------------+
                               |
                               v
+----------------+     +-------+--------+     +----------------+
| Metrics Source | --> | Metrics Ingest | --> | Alert Engine   |
+----------------+     +----------------+     +-------+--------+
                                                       |
                                                       v
                                              +--------+---------+
                                              | Notification      |
                                              | Service          |
                                              +------------------+

Components

Threshold Configuration Service

REST API with database (PostgreSQL)

Manage user-defined alert thresholds and store configurations.

Metrics Ingestion and Processing Pipeline

Kafka + Stream Processing (Apache Flink or Spark Streaming)

Receive and process incoming metrics in real-time.

Alert Evaluation Engine

Stream processing with rules engine

Evaluate metrics against thresholds and generate alerts.

Notification Service

Microservice with email/SMS gateways (e.g., Twilio, SMTP)

Send alert notifications reliably with retries.

User Dashboard and API

Web frontend (React) + backend API

Allow users to view and manage thresholds and alerts.

Database

PostgreSQL for configurations and alert history

Store threshold settings and alert records with retention.

Request Flow

1. User sets or updates alert thresholds via the Dashboard/API.

2. Threshold Configuration Service validates and stores these settings in the database.

3. Metrics from services are sent continuously to the Metrics Ingestion Pipeline.

4. The pipeline processes metrics and forwards them to the Alert Evaluation Engine.

5. The Alert Evaluation Engine compares metrics against stored thresholds.

6. When a threshold is breached, an alert event is generated.

7. Alert events are sent to the Notification Service.

8. Notification Service sends alerts to users via email or SMS, retrying on failure.

9. Alerts and their statuses are stored in the database for dashboard display.

Database Schema

Entities: - User (id, name, contact_info) - Threshold (id, user_id, metric_name, threshold_type, threshold_value, time_period, service_group) - Alert (id, threshold_id, metric_value, timestamp, status) Relationships: - User 1:N Threshold - Threshold 1:N Alert - Alerts linked to specific metric and service group for filtering

Scaling Discussion

Bottlenecks

High volume of incoming metrics causing ingestion delays.

Alert Evaluation Engine becoming CPU-bound with many thresholds.

Notification Service overwhelmed by large alert bursts.

Database write/read bottlenecks for storing configurations and alerts.

Solutions

Partition metrics ingestion by service groups and scale Kafka clusters horizontally.

Distribute alert evaluation across multiple nodes with sharding by metric or service.

Implement notification rate limiting, batching, and use scalable third-party gateways.

Use database sharding, read replicas, and caching for threshold configurations.

Interview Tips

Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Clarify types of thresholds and alerting needs early.

Explain how real-time processing enables low latency alerts.

Discuss reliability in notification delivery with retries.

Highlight database design supporting retention and querying.

Address scaling challenges and horizontal scaling strategies.

Mention user experience via dashboard and API.