0
0
HLDsystem_design~25 mins

Metrics collection in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Metrics Collection System
Design covers metrics ingestion, storage, querying, and aggregation. Out of scope: detailed alerting system, visualization dashboards.
Functional Requirements
FR1: Collect metrics data from multiple application instances in real-time
FR2: Support different types of metrics: counters, gauges, histograms
FR3: Allow querying aggregated metrics for monitoring and alerting
FR4: Store metrics data efficiently for at least 30 days
FR5: Provide APIs for metrics ingestion and querying
FR6: Ensure minimal impact on application performance during metrics collection
Non-Functional Requirements
NFR1: Handle up to 100,000 metrics data points per second
NFR2: API response latency for queries should be under 200ms (p99)
NFR3: System availability should be at least 99.9%
NFR4: Data retention for 30 days with efficient storage
NFR5: Support horizontal scaling for ingestion and querying
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Metrics ingestion API servers
Buffering and batching layer
Time-series database or storage
Aggregation and query engine
Cache layer for hot queries
Monitoring and alerting hooks
Design Patterns
Push vs pull metrics collection
Batching and buffering for ingestion
Time-series data modeling
Data downsampling and retention policies
CQRS (Command Query Responsibility Segregation) for ingestion and querying
Sharding and partitioning for scale
Reference Architecture
  +----------------+      +----------------+      +------------------+      +------------------+
  | Application(s) | ---> | Ingestion API  | ---> | Buffering Layer  | ---> | Time-Series DB   |
  +----------------+      +----------------+      +------------------+      +------------------+
                                                                                      |
                                                                                      v
                                                                               +------------------+
                                                                               | Query API Server |
                                                                               +------------------+
                                                                                      |
                                                                                      v
                                                                               +------------------+
                                                                               | Cache Layer      |
                                                                               +------------------+
Components
Ingestion API
REST/gRPC servers
Receive metrics data from applications and validate input
Buffering Layer
Message queue (e.g., Kafka)
Buffer and batch incoming metrics for efficient processing
Time-Series Database
TSDB like Prometheus, InfluxDB, or TimescaleDB
Store metrics data with time-based indexing and support aggregation queries
Query API Server
REST/gRPC servers
Serve aggregated metrics queries to clients
Cache Layer
In-memory cache like Redis
Cache frequent query results to reduce latency
Request Flow
1. 1. Applications push metrics data to the Ingestion API.
2. 2. Ingestion API validates and forwards data to the Buffering Layer.
3. 3. Buffering Layer batches data and writes to the Time-Series Database.
4. 4. Clients query metrics via the Query API Server.
5. 5. Query API checks Cache Layer for results; if missing, queries Time-Series Database.
6. 6. Query results are returned to clients and cached for future requests.
Database Schema
Entities: - Metric: id, name, type (counter, gauge, histogram) - MetricData: metric_id, timestamp, value, labels (key-value pairs) Relationships: - MetricData references Metric by metric_id - Labels stored as JSON or separate key-value table for filtering - Time-series data indexed by metric_id and timestamp for efficient range queries
Scaling Discussion
Bottlenecks
Ingestion API servers overwhelmed by high write volume
Buffering Layer lag causing delayed writes
Time-Series Database storage and query performance degradation
Query API latency under heavy read load
Cache misses causing increased DB load
Solutions
Scale Ingestion API horizontally behind load balancers
Partition Buffering Layer topics by metric or tenant for parallelism
Use sharded or distributed TSDB clusters with data partitioning
Implement query rate limiting and optimize query plans
Increase cache size and implement cache warming strategies
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Clarify metric types and query patterns early
Explain choice of buffering and storage technologies
Discuss data modeling for time-series data
Highlight caching to reduce query latency
Address scaling challenges with partitioning and horizontal scaling
Mention trade-offs between data freshness and system load