0
0
HLDsystem_design~25 mins

Throughput, latency, and availability in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Performance Metrics Analysis System
Design focuses on capturing, processing, and reporting throughput, latency, and availability metrics. It excludes detailed application logic or business features.
Functional Requirements
FR1: Measure and report system throughput in requests per second
FR2: Track latency for each request with p50, p95, and p99 percentiles
FR3: Monitor system availability with uptime percentage and downtime alerts
FR4: Provide real-time dashboards for throughput, latency, and availability
FR5: Support alerting when latency or availability thresholds are breached
Non-Functional Requirements
NFR1: Handle up to 100,000 requests per second
NFR2: Latency measurement accuracy within 1 millisecond
NFR3: Availability target of 99.9% uptime (less than 8.77 hours downtime per year)
NFR4: Dashboard updates with less than 5 seconds delay
NFR5: System must be fault tolerant and highly available
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Metrics collection agents or instrumentation
Data ingestion pipeline for metrics
Time-series database for storing metrics
Real-time analytics engine for percentile calculations
Dashboard and alerting system
Design Patterns
Event-driven architecture for metrics ingestion
Sliding window or histogram for latency percentiles
Circuit breaker pattern for availability monitoring
Caching for dashboard performance
Redundancy and failover for high availability
Reference Architecture
  +----------------+       +------------------+       +---------------------+
  | Metrics Source | ----> | Metrics Collector | ----> | Time-Series Database |
  +----------------+       +------------------+       +---------------------+
                                      |                          |
                                      v                          v
                             +----------------+         +----------------+
                             | Analytics Engine|         | Alerting System|
                             +----------------+         +----------------+
                                      |
                                      v
                             +----------------+
                             |   Dashboard    |
                             +----------------+
Components
Metrics Source
Application instrumentation, SDKs
Generate throughput, latency, and availability data from user requests
Metrics Collector
Prometheus, Fluentd, or custom agents
Aggregate and forward metrics data to storage
Time-Series Database
InfluxDB, Prometheus TSDB, or TimescaleDB
Store time-stamped metrics efficiently for querying
Analytics Engine
Apache Flink, Spark Streaming, or custom service
Calculate latency percentiles and throughput in real-time
Alerting System
PagerDuty, Grafana Alerting, or custom alerts
Notify operators when latency or availability thresholds are violated
Dashboard
Grafana, Kibana, or custom web UI
Visualize throughput, latency, and availability metrics in real-time
Request Flow
1. User requests generate metrics data (timestamps, response times, success/failure).
2. Metrics Collector receives data from sources and batches it for efficiency.
3. Data is stored in the Time-Series Database with timestamps for historical analysis.
4. Analytics Engine queries the database continuously to compute throughput and latency percentiles.
5. Alerting System monitors analytics results and triggers alerts on SLA breaches.
6. Dashboard queries analytics and database to display current and historical metrics to users.
Database Schema
Entities: - MetricRecord: {id, timestamp, metric_type (throughput/latency/availability), value, request_id} - Alert: {id, metric_type, threshold, triggered_at, resolved_at, status} Relationships: - MetricRecord stores raw data points linked by timestamp. - Alerts reference metric types and track alert lifecycle.
Scaling Discussion
Bottlenecks
Metrics Collector overwhelmed by high request volume
Time-Series Database storage and query latency under heavy load
Analytics Engine processing delays with large data streams
Alerting System flooding with false positives or missed alerts
Dashboard performance degradation with many concurrent users
Solutions
Use load balancing and sharding for Metrics Collector to distribute ingestion load
Partition Time-Series Database by time and metric type; use compression and downsampling
Scale Analytics Engine horizontally with stream partitioning and parallel processing
Implement adaptive alert thresholds and deduplication to reduce noise
Cache dashboard queries and use CDN for static assets to improve responsiveness
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scale. Use 20 minutes to design components and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Leave 5 minutes for questions.
Explain how throughput, latency, and availability differ and why each matters
Describe how metrics are collected without impacting system performance
Discuss real-time analytics techniques for latency percentiles
Highlight fault tolerance and high availability strategies
Address how to handle scale and maintain low latency in monitoring