0
0
Microservicessystem_design~25 mins

Metrics collection (Prometheus) in Microservices - System Design Exercise

Choose your learning style9 modes available
Design: Metrics Collection System with Prometheus
Includes metrics collection, storage, querying, and alerting. Excludes detailed dashboard UI design and long-term archival beyond 15 days.
Functional Requirements
FR1: Collect real-time metrics from multiple microservices
FR2: Support scraping metrics at regular intervals (e.g., every 15 seconds)
FR3: Store metrics data efficiently for querying and alerting
FR4: Provide a dashboard for visualizing metrics
FR5: Support alerting based on defined thresholds
FR6: Handle up to 10,000 metrics per second from 100 microservices
Non-Functional Requirements
NFR1: Scrape latency should be under 5 seconds
NFR2: System availability should be 99.9%
NFR3: Storage retention for metrics data should be configurable (default 15 days)
NFR4: Minimal impact on microservices performance during metrics collection
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Prometheus server for scraping and storing metrics
Exporters integrated into microservices to expose metrics
Alertmanager for managing alerts
Grafana or similar for visualization
Service discovery mechanism for dynamic microservice endpoints
Design Patterns
Pull-based metrics collection
Time-series data storage
Service discovery for dynamic targets
Alerting based on threshold rules
Horizontal scaling of Prometheus using federation
Reference Architecture
                    +----------------+
                    |   Grafana UI   |
                    +--------+-------+
                             |
                             v
+----------------+      +----+-----+      +--------------+
| Microservices  | ---> | Prometheus| ---> | Alertmanager |
| (with Exporter)|      |  Server   |      +--------------+
+----------------+      +----+-----+
                             |
                             v
                      +------+-------+
                      |  TSDB Storage |
                      +--------------+
Components
Microservices with Exporter
Any microservice framework with Prometheus client libraries
Expose application metrics in Prometheus format at /metrics endpoint
Prometheus Server
Prometheus open-source server
Scrape metrics from microservices, store time-series data, and provide query API
Alertmanager
Prometheus Alertmanager
Manage alerts, group, route, and send notifications
Grafana
Grafana open-source dashboard
Visualize metrics data and create dashboards
TSDB Storage
Prometheus built-in time-series database
Efficiently store scraped metrics data with retention policy
Service Discovery
Kubernetes API, Consul, or static config
Dynamically discover microservice endpoints for scraping
Request Flow
1. 1. Each microservice exposes metrics at /metrics endpoint using Prometheus client library.
2. 2. Prometheus server periodically scrapes /metrics endpoints from all microservices using service discovery.
3. 3. Scraped metrics are stored in Prometheus TSDB with timestamps.
4. 4. Users query metrics data via Prometheus API or Grafana dashboards.
5. 5. Alertmanager receives alert rules from Prometheus and sends notifications when thresholds are crossed.
Database Schema
Prometheus uses a time-series database schema where each metric is stored as a time-stamped data point with labels (key-value pairs) identifying the metric source and type. No traditional relational schema is used. Key entities: Metric Name, Labels (e.g., service, instance), Timestamp, Value.
Scaling Discussion
Bottlenecks
Prometheus server CPU and memory limits when scraping many targets or high metric volume
Storage capacity and write throughput for TSDB
Network bandwidth for scraping metrics
Alertmanager handling large alert volumes
Solutions
Use Prometheus federation to aggregate metrics from multiple Prometheus servers
Shard scraping targets across multiple Prometheus instances
Use remote storage integrations (e.g., Thanos, Cortex) for long-term storage and scaling
Optimize scrape intervals and metric cardinality to reduce load
Scale Alertmanager horizontally and configure alert grouping
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scale. Use 20 minutes to design architecture and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Leave 5 minutes for questions.
Explain pull-based scraping and why Prometheus uses it
Discuss metric types and how they affect storage and querying
Describe service discovery for dynamic microservices
Highlight alerting mechanism and integration with Alertmanager
Address scaling challenges and solutions like federation and remote storage