HLDsystem_design~25 mins

Throughput, latency, and availability in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Performance Metrics Analysis System

Design focuses on capturing, processing, and reporting throughput, latency, and availability metrics. It excludes detailed application logic or business features.

Functional Requirements

FR1: Measure and report system throughput in requests per second

FR2: Track latency for each request with p50, p95, and p99 percentiles

FR3: Monitor system availability with uptime percentage and downtime alerts

FR4: Provide real-time dashboards for throughput, latency, and availability

FR5: Support alerting when latency or availability thresholds are breached

Non-Functional Requirements

NFR1: Handle up to 100,000 requests per second

NFR2: Latency measurement accuracy within 1 millisecond

NFR3: Availability target of 99.9% uptime (less than 8.77 hours downtime per year)

NFR4: Dashboard updates with less than 5 seconds delay

NFR5: System must be fault tolerant and highly available

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

Key Components

Metrics collection agents or instrumentation

Data ingestion pipeline for metrics

Time-series database for storing metrics

Real-time analytics engine for percentile calculations

Dashboard and alerting system

Design Patterns

Event-driven architecture for metrics ingestion

Sliding window or histogram for latency percentiles

Circuit breaker pattern for availability monitoring

Caching for dashboard performance

Redundancy and failover for high availability

Reference Architecture

  +----------------+       +------------------+       +---------------------+
  | Metrics Source | ----> | Metrics Collector | ----> | Time-Series Database |
  +----------------+       +------------------+       +---------------------+
                                      |                          |
                                      v                          v
                             +----------------+         +----------------+
                             | Analytics Engine|         | Alerting System|
                             +----------------+         +----------------+
                                      |
                                      v
                             +----------------+
                             |   Dashboard    |
                             +----------------+

Components

Metrics Source

Application instrumentation, SDKs

Generate throughput, latency, and availability data from user requests

Metrics Collector

Prometheus, Fluentd, or custom agents

Aggregate and forward metrics data to storage

Time-Series Database

InfluxDB, Prometheus TSDB, or TimescaleDB

Store time-stamped metrics efficiently for querying

Analytics Engine

Apache Flink, Spark Streaming, or custom service

Calculate latency percentiles and throughput in real-time

Alerting System

PagerDuty, Grafana Alerting, or custom alerts

Notify operators when latency or availability thresholds are violated

Dashboard

Grafana, Kibana, or custom web UI

Visualize throughput, latency, and availability metrics in real-time

Request Flow

1. User requests generate metrics data (timestamps, response times, success/failure).

2. Metrics Collector receives data from sources and batches it for efficiency.

3. Data is stored in the Time-Series Database with timestamps for historical analysis.

4. Analytics Engine queries the database continuously to compute throughput and latency percentiles.

5. Alerting System monitors analytics results and triggers alerts on SLA breaches.

6. Dashboard queries analytics and database to display current and historical metrics to users.

Database Schema

Entities: - MetricRecord: {id, timestamp, metric_type (throughput/latency/availability), value, request_id} - Alert: {id, metric_type, threshold, triggered_at, resolved_at, status} Relationships: - MetricRecord stores raw data points linked by timestamp. - Alerts reference metric types and track alert lifecycle.

Scaling Discussion

Bottlenecks

Metrics Collector overwhelmed by high request volume

Time-Series Database storage and query latency under heavy load

Analytics Engine processing delays with large data streams

Alerting System flooding with false positives or missed alerts

Dashboard performance degradation with many concurrent users

Solutions

Use load balancing and sharding for Metrics Collector to distribute ingestion load

Partition Time-Series Database by time and metric type; use compression and downsampling

Scale Analytics Engine horizontally with stream partitioning and parallel processing

Implement adaptive alert thresholds and deduplication to reduce noise

Cache dashboard queries and use CDN for static assets to improve responsiveness

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying scale. Use 20 minutes to design components and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Leave 5 minutes for questions.

Explain how throughput, latency, and availability differ and why each matters

Describe how metrics are collected without impacting system performance

Discuss real-time analytics techniques for latency percentiles

Highlight fault tolerance and high availability strategies

Address how to handle scale and maintain low latency in monitoring