Microservicessystem_design~25 mins

Metrics collection (Prometheus) in Microservices - System Design Exercise

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Design: Metrics Collection System with Prometheus

Includes metrics collection, storage, querying, and alerting. Excludes detailed dashboard UI design and long-term archival beyond 15 days.

Functional Requirements

FR1: Collect real-time metrics from multiple microservices

FR2: Support scraping metrics at regular intervals (e.g., every 15 seconds)

FR3: Store metrics data efficiently for querying and alerting

FR4: Provide a dashboard for visualizing metrics

FR5: Support alerting based on defined thresholds

FR6: Handle up to 10,000 metrics per second from 100 microservices

Non-Functional Requirements

NFR1: Scrape latency should be under 5 seconds

NFR2: System availability should be 99.9%

NFR3: Storage retention for metrics data should be configurable (default 15 days)

NFR4: Minimal impact on microservices performance during metrics collection

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Prometheus server for scraping and storing metrics

Exporters integrated into microservices to expose metrics

Alertmanager for managing alerts

Grafana or similar for visualization

Service discovery mechanism for dynamic microservice endpoints

Design Patterns

Pull-based metrics collection

Time-series data storage

Service discovery for dynamic targets

Alerting based on threshold rules

Horizontal scaling of Prometheus using federation

Reference Architecture

                    +----------------+
                    |   Grafana UI   |
                    +--------+-------+
                             |
                             v
+----------------+      +----+-----+      +--------------+
| Microservices  | ---> | Prometheus| ---> | Alertmanager |
| (with Exporter)|      |  Server   |      +--------------+
+----------------+      +----+-----+
                             |
                             v
                      +------+-------+
                      |  TSDB Storage |
                      +--------------+

Components

Microservices with Exporter

Any microservice framework with Prometheus client libraries

Expose application metrics in Prometheus format at /metrics endpoint

Prometheus Server

Prometheus open-source server

Scrape metrics from microservices, store time-series data, and provide query API

Alertmanager

Prometheus Alertmanager

Manage alerts, group, route, and send notifications

Grafana

Grafana open-source dashboard

Visualize metrics data and create dashboards

TSDB Storage

Prometheus built-in time-series database

Efficiently store scraped metrics data with retention policy

Service Discovery

Kubernetes API, Consul, or static config

Dynamically discover microservice endpoints for scraping

Request Flow

1. 1. Each microservice exposes metrics at /metrics endpoint using Prometheus client library.

2. 2. Prometheus server periodically scrapes /metrics endpoints from all microservices using service discovery.

3. 3. Scraped metrics are stored in Prometheus TSDB with timestamps.

4. 4. Users query metrics data via Prometheus API or Grafana dashboards.

5. 5. Alertmanager receives alert rules from Prometheus and sends notifications when thresholds are crossed.

Database Schema

Prometheus uses a time-series database schema where each metric is stored as a time-stamped data point with labels (key-value pairs) identifying the metric source and type. No traditional relational schema is used. Key entities: Metric Name, Labels (e.g., service, instance), Timestamp, Value.

Scaling Discussion

Bottlenecks

Prometheus server CPU and memory limits when scraping many targets or high metric volume

Storage capacity and write throughput for TSDB

Network bandwidth for scraping metrics

Alertmanager handling large alert volumes

Solutions

Use Prometheus federation to aggregate metrics from multiple Prometheus servers

Shard scraping targets across multiple Prometheus instances

Use remote storage integrations (e.g., Thanos, Cortex) for long-term storage and scaling

Optimize scrape intervals and metric cardinality to reduce load

Scale Alertmanager horizontally and configure alert grouping

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying scale. Use 20 minutes to design architecture and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Leave 5 minutes for questions.

Explain pull-based scraping and why Prometheus uses it

Discuss metric types and how they affect storage and querying

Describe service discovery for dynamic microservices

Highlight alerting mechanism and integration with Alertmanager

Address scaling challenges and solutions like federation and remote storage

Practice

(1/5)

1. What is the main purpose of Prometheus in a microservices environment?

easy

A. To collect and store metrics from services for monitoring

B. To deploy microservices automatically

C. To manage user authentication

D. To serve web pages to users

Metrics collection (Prometheus) in Microservices - System Design Exercise

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand rate() function

Step 2: Apply to http_requests_total[5m]

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`