HLDsystem_design~25 mins

Metrics collection in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Metrics Collection System

Design covers metrics ingestion, storage, querying, and aggregation. Out of scope: detailed alerting system, visualization dashboards.

Functional Requirements

FR1: Collect metrics data from multiple application instances in real-time

FR2: Support different types of metrics: counters, gauges, histograms

FR3: Allow querying aggregated metrics for monitoring and alerting

FR4: Store metrics data efficiently for at least 30 days

FR5: Provide APIs for metrics ingestion and querying

FR6: Ensure minimal impact on application performance during metrics collection

Non-Functional Requirements

NFR1: Handle up to 100,000 metrics data points per second

NFR2: API response latency for queries should be under 200ms (p99)

NFR3: System availability should be at least 99.9%

NFR4: Data retention for 30 days with efficient storage

NFR5: Support horizontal scaling for ingestion and querying

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Metrics ingestion API servers

Buffering and batching layer

Time-series database or storage

Aggregation and query engine

Cache layer for hot queries

Monitoring and alerting hooks

Design Patterns

Push vs pull metrics collection

Batching and buffering for ingestion

Time-series data modeling

Data downsampling and retention policies

CQRS (Command Query Responsibility Segregation) for ingestion and querying

Sharding and partitioning for scale

Reference Architecture

  +----------------+      +----------------+      +------------------+      +------------------+
  | Application(s) | ---> | Ingestion API  | ---> | Buffering Layer  | ---> | Time-Series DB   |
  +----------------+      +----------------+      +------------------+      +------------------+
                                                                                      |
                                                                                      v
                                                                               +------------------+
                                                                               | Query API Server |
                                                                               +------------------+
                                                                                      |
                                                                                      v
                                                                               +------------------+
                                                                               | Cache Layer      |
                                                                               +------------------+

Components

Ingestion API

REST/gRPC servers

Receive metrics data from applications and validate input

Buffering Layer

Message queue (e.g., Kafka)

Buffer and batch incoming metrics for efficient processing

Time-Series Database

TSDB like Prometheus, InfluxDB, or TimescaleDB

Store metrics data with time-based indexing and support aggregation queries

Query API Server

REST/gRPC servers

Serve aggregated metrics queries to clients

Cache Layer

In-memory cache like Redis

Cache frequent query results to reduce latency

Request Flow

1. 1. Applications push metrics data to the Ingestion API.

2. 2. Ingestion API validates and forwards data to the Buffering Layer.

3. 3. Buffering Layer batches data and writes to the Time-Series Database.

4. 4. Clients query metrics via the Query API Server.

5. 5. Query API checks Cache Layer for results; if missing, queries Time-Series Database.

6. 6. Query results are returned to clients and cached for future requests.

Database Schema

Entities: - Metric: id, name, type (counter, gauge, histogram) - MetricData: metric_id, timestamp, value, labels (key-value pairs) Relationships: - MetricData references Metric by metric_id - Labels stored as JSON or separate key-value table for filtering - Time-series data indexed by metric_id and timestamp for efficient range queries

Scaling Discussion

Bottlenecks

Ingestion API servers overwhelmed by high write volume

Buffering Layer lag causing delayed writes

Time-Series Database storage and query performance degradation

Query API latency under heavy read load

Cache misses causing increased DB load

Solutions

Scale Ingestion API horizontally behind load balancers

Partition Buffering Layer topics by metric or tenant for parallelism

Use sharded or distributed TSDB clusters with data partitioning

Implement query rate limiting and optimize query plans

Increase cache size and implement cache warming strategies

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Clarify metric types and query patterns early

Explain choice of buffering and storage technologies

Discuss data modeling for time-series data

Highlight caching to reduce query latency

Address scaling challenges with partitioning and horizontal scaling

Mention trade-offs between data freshness and system load