Bird
Raised Fist0
Microservicessystem_design~25 mins

Metrics collection (Prometheus) in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Metrics Collection System with Prometheus
Includes metrics collection, storage, querying, and alerting. Excludes detailed dashboard UI design and long-term archival beyond 15 days.
Functional Requirements
FR1: Collect real-time metrics from multiple microservices
FR2: Support scraping metrics at regular intervals (e.g., every 15 seconds)
FR3: Store metrics data efficiently for querying and alerting
FR4: Provide a dashboard for visualizing metrics
FR5: Support alerting based on defined thresholds
FR6: Handle up to 10,000 metrics per second from 100 microservices
Non-Functional Requirements
NFR1: Scrape latency should be under 5 seconds
NFR2: System availability should be 99.9%
NFR3: Storage retention for metrics data should be configurable (default 15 days)
NFR4: Minimal impact on microservices performance during metrics collection
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Prometheus server for scraping and storing metrics
Exporters integrated into microservices to expose metrics
Alertmanager for managing alerts
Grafana or similar for visualization
Service discovery mechanism for dynamic microservice endpoints
Design Patterns
Pull-based metrics collection
Time-series data storage
Service discovery for dynamic targets
Alerting based on threshold rules
Horizontal scaling of Prometheus using federation
Reference Architecture
                    +----------------+
                    |   Grafana UI   |
                    +--------+-------+
                             |
                             v
+----------------+      +----+-----+      +--------------+
| Microservices  | ---> | Prometheus| ---> | Alertmanager |
| (with Exporter)|      |  Server   |      +--------------+
+----------------+      +----+-----+
                             |
                             v
                      +------+-------+
                      |  TSDB Storage |
                      +--------------+
Components
Microservices with Exporter
Any microservice framework with Prometheus client libraries
Expose application metrics in Prometheus format at /metrics endpoint
Prometheus Server
Prometheus open-source server
Scrape metrics from microservices, store time-series data, and provide query API
Alertmanager
Prometheus Alertmanager
Manage alerts, group, route, and send notifications
Grafana
Grafana open-source dashboard
Visualize metrics data and create dashboards
TSDB Storage
Prometheus built-in time-series database
Efficiently store scraped metrics data with retention policy
Service Discovery
Kubernetes API, Consul, or static config
Dynamically discover microservice endpoints for scraping
Request Flow
1. 1. Each microservice exposes metrics at /metrics endpoint using Prometheus client library.
2. 2. Prometheus server periodically scrapes /metrics endpoints from all microservices using service discovery.
3. 3. Scraped metrics are stored in Prometheus TSDB with timestamps.
4. 4. Users query metrics data via Prometheus API or Grafana dashboards.
5. 5. Alertmanager receives alert rules from Prometheus and sends notifications when thresholds are crossed.
Database Schema
Prometheus uses a time-series database schema where each metric is stored as a time-stamped data point with labels (key-value pairs) identifying the metric source and type. No traditional relational schema is used. Key entities: Metric Name, Labels (e.g., service, instance), Timestamp, Value.
Scaling Discussion
Bottlenecks
Prometheus server CPU and memory limits when scraping many targets or high metric volume
Storage capacity and write throughput for TSDB
Network bandwidth for scraping metrics
Alertmanager handling large alert volumes
Solutions
Use Prometheus federation to aggregate metrics from multiple Prometheus servers
Shard scraping targets across multiple Prometheus instances
Use remote storage integrations (e.g., Thanos, Cortex) for long-term storage and scaling
Optimize scrape intervals and metric cardinality to reduce load
Scale Alertmanager horizontally and configure alert grouping
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scale. Use 20 minutes to design architecture and data flow. Reserve 10 minutes to discuss scaling and trade-offs. Leave 5 minutes for questions.
Explain pull-based scraping and why Prometheus uses it
Discuss metric types and how they affect storage and querying
Describe service discovery for dynamic microservices
Highlight alerting mechanism and integration with Alertmanager
Address scaling challenges and solutions like federation and remote storage

Practice

(1/5)
1. What is the main purpose of Prometheus in a microservices environment?
easy
A. To collect and store metrics from services for monitoring
B. To deploy microservices automatically
C. To manage user authentication
D. To serve web pages to users

Solution

  1. Step 1: Understand Prometheus role

    Prometheus is designed to collect numerical data called metrics from running services.
  2. Step 2: Identify monitoring purpose

    These metrics help monitor service health and performance in microservices.
  3. Final Answer:

    To collect and store metrics from services for monitoring -> Option A
  4. Quick Check:

    Prometheus = Metrics collection [OK]
Hint: Prometheus is for metrics, not deployment or auth [OK]
Common Mistakes:
  • Confusing Prometheus with deployment tools
  • Thinking Prometheus manages users
  • Assuming Prometheus serves web content
2. Which YAML configuration snippet correctly defines a Prometheus scrape job for a service at http://localhost:8080/metrics?
easy
A. jobs: - job: 'myservice' endpoints: ['localhost:8080']
B. scrape_configs: - job_name: 'myservice' static_configs: - targets: ['http://localhost:8080/metrics']
C. scrape_configs: - job_name: 'myservice' static_configs: - targets: ['localhost:8080']
D. scrape_jobs: - name: 'myservice' targets: ['localhost:8080/metrics']

Solution

  1. Step 1: Check Prometheus YAML syntax

    Prometheus uses scrape_configs with job_name and static_configs listing targets as host:port without URL path.
  2. Step 2: Validate target format

    Targets must be host:port only, no http:// or path like /metrics.
  3. Final Answer:

    scrape_configs: - job_name: 'myservice' static_configs: - targets: ['localhost:8080'] -> Option C
  4. Quick Check:

    Targets = host:port only [OK]
Hint: Targets list host:port only, no URL scheme or path [OK]
Common Mistakes:
  • Including http:// or /metrics in targets
  • Using wrong YAML keys like scrape_jobs or jobs
  • Misnaming job_name or static_configs
3. Given this Prometheus query: rate(http_requests_total[5m]), what does it calculate?
medium
A. The average rate of HTTP requests per second over the last 5 minutes
B. The current number of active HTTP requests
C. The total number of HTTP requests since service start
D. The maximum number of HTTP requests in the last 5 minutes

Solution

  1. Step 1: Understand rate() function

    The rate() function calculates the per-second average increase of a counter over a time window.
  2. Step 2: Apply to http_requests_total[5m]

    This means it measures how fast the total HTTP requests counter increased in the last 5 minutes, giving requests per second.
  3. Final Answer:

    The average rate of HTTP requests per second over the last 5 minutes -> Option A
  4. Quick Check:

    rate() = per-second average increase [OK]
Hint: rate() gives per-second average over time window [OK]
Common Mistakes:
  • Thinking rate() returns total count
  • Confusing rate() with current active requests
  • Assuming rate() returns max value
4. You configured Prometheus to scrape localhost:9090 but no metrics appear. Which fix is correct?
medium
A. Change target to localhost:9090/metrics in YAML
B. Remove job_name from config
C. Restart Prometheus to reload config
D. Add metrics_path: '/metrics' under the scrape job

Solution

  1. Step 1: Understand default metrics path

    Prometheus scrapes /metrics path by default, but if the service uses a different path, you must specify it.
  2. Step 2: Fix missing metrics path

    Adding metrics_path: '/metrics' explicitly tells Prometheus where to get metrics if not default or to confirm path.
  3. Final Answer:

    Add metrics_path: '/metrics' under the scrape job -> Option D
  4. Quick Check:

    metrics_path fixes scrape URL [OK]
Hint: Use metrics_path to set correct scrape URL path [OK]
Common Mistakes:
  • Adding path in targets instead of metrics_path
  • Restarting without config fix
  • Removing job_name breaks config
5. You want to monitor error rates in a microservice using Prometheus. The service exposes http_requests_total with labels status and method. Which query shows the error rate (status codes 500-599) over the last 10 minutes as a percentage of all requests?
hard
A. rate(http_requests_total{status=~"5.."}[10m]) / rate(http_requests_total[10m]) * 100
B. sum(rate(http_requests_total{status=~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100
C. sum(rate(http_requests_total{status=~"5.."}[10m])) * 100
D. sum(rate(http_requests_total{status!~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100

Solution

  1. Step 1: Filter error status codes 500-599

    Use regex status=~"5.." to select error codes in the 500 range.
  2. Step 2: Calculate error rate as percentage

    Sum the rate of error requests and divide by sum of all requests rate, then multiply by 100 for percentage.
  3. Final Answer:

    sum(rate(http_requests_total{status=~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100 -> Option B
  4. Quick Check:

    Error rate % = error requests / total requests * 100 [OK]
Hint: Sum rates before division for correct percentage [OK]
Common Mistakes:
  • Dividing single rates instead of sums
  • Using wrong label regex
  • Multiplying before division