Bird
Raised Fist0
Microservicessystem_design~7 mins

Metrics collection (Prometheus) in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
Without a centralized and efficient way to collect metrics, monitoring microservices becomes unreliable and slow. This leads to delayed detection of failures, poor understanding of system health, and difficulty in troubleshooting performance issues.
Solution
Prometheus solves this by scraping metrics from each microservice at regular intervals using a pull model. It stores these metrics as time-series data, allowing real-time querying and alerting. Each service exposes an HTTP endpoint with metrics in a standard format, enabling Prometheus to collect and aggregate data efficiently.
Architecture
Prometheus
Server
Alertmanager

This diagram shows Prometheus scraping metrics from microservices exposing HTTP endpoints and forwarding alerts to Alertmanager.

Trade-offs
✓ Pros
Pull-based scraping allows Prometheus to control when and how often metrics are collected.
Time-series storage enables efficient querying and historical analysis of metrics.
Standardized metrics format simplifies integration with diverse microservices.
Built-in alerting supports proactive incident response.
✗ Cons
Pull model requires services to expose HTTP endpoints, which may not be feasible in all environments.
High scrape frequency can increase network and CPU load on services.
Scaling Prometheus for very large environments requires federation or sharding, adding complexity.
Use Prometheus when you have multiple microservices that can expose HTTP endpoints and need real-time monitoring with alerting, especially at scales from hundreds to thousands of services.
Avoid Prometheus if your services cannot expose HTTP endpoints or if you have extremely high cardinality metrics that exceed Prometheus's storage and query capabilities.
Real World Examples
Netflix
Netflix uses Prometheus to monitor microservices performance and availability, enabling rapid detection of streaming issues.
Uber
Uber employs Prometheus to collect metrics from its ride-hailing microservices, supporting real-time alerting and capacity planning.
Spotify
Spotify integrates Prometheus to track service health and user request latencies across its music streaming platform.
Code Example
Before, the service did not expose any metrics. After applying Prometheus metrics collection, the service increments a counter on each request and exposes a /metrics HTTP endpoint that Prometheus can scrape.
Microservices
### Before: No metrics exposed
from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello():
    return "Hello World"

if __name__ == '__main__':
    app.run()


### After: Exposing Prometheus metrics
from flask import Flask
from prometheus_client import Counter, generate_latest, CONTENT_TYPE_LATEST

app = Flask(__name__)

REQUEST_COUNT = Counter('app_requests_total', 'Total HTTP Requests')

@app.route('/')
def hello():
    REQUEST_COUNT.inc()
    return "Hello World"

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run()
OutputSuccess
Alternatives
Pushgateway
Pushgateway allows services that cannot be scraped to push metrics to Prometheus instead of being pulled.
Use when: Use Pushgateway when services are short-lived or behind firewalls preventing Prometheus from scraping them.
StatsD
StatsD uses a push model where services send metrics to a daemon that aggregates and forwards them.
Use when: Choose StatsD when you prefer a push model or have legacy systems that cannot expose HTTP endpoints.
OpenTelemetry
OpenTelemetry provides a vendor-neutral framework for collecting metrics, traces, and logs, supporting multiple backends including Prometheus.
Use when: Use OpenTelemetry when you want unified observability data collection across metrics, traces, and logs.
Summary
Prometheus collects metrics by scraping HTTP endpoints exposed by microservices.
It stores metrics as time-series data enabling real-time monitoring and alerting.
Prometheus is widely used in microservices environments for scalable and reliable observability.

Practice

(1/5)
1. What is the main purpose of Prometheus in a microservices environment?
easy
A. To collect and store metrics from services for monitoring
B. To deploy microservices automatically
C. To manage user authentication
D. To serve web pages to users

Solution

  1. Step 1: Understand Prometheus role

    Prometheus is designed to collect numerical data called metrics from running services.
  2. Step 2: Identify monitoring purpose

    These metrics help monitor service health and performance in microservices.
  3. Final Answer:

    To collect and store metrics from services for monitoring -> Option A
  4. Quick Check:

    Prometheus = Metrics collection [OK]
Hint: Prometheus is for metrics, not deployment or auth [OK]
Common Mistakes:
  • Confusing Prometheus with deployment tools
  • Thinking Prometheus manages users
  • Assuming Prometheus serves web content
2. Which YAML configuration snippet correctly defines a Prometheus scrape job for a service at http://localhost:8080/metrics?
easy
A. jobs: - job: 'myservice' endpoints: ['localhost:8080']
B. scrape_configs: - job_name: 'myservice' static_configs: - targets: ['http://localhost:8080/metrics']
C. scrape_configs: - job_name: 'myservice' static_configs: - targets: ['localhost:8080']
D. scrape_jobs: - name: 'myservice' targets: ['localhost:8080/metrics']

Solution

  1. Step 1: Check Prometheus YAML syntax

    Prometheus uses scrape_configs with job_name and static_configs listing targets as host:port without URL path.
  2. Step 2: Validate target format

    Targets must be host:port only, no http:// or path like /metrics.
  3. Final Answer:

    scrape_configs: - job_name: 'myservice' static_configs: - targets: ['localhost:8080'] -> Option C
  4. Quick Check:

    Targets = host:port only [OK]
Hint: Targets list host:port only, no URL scheme or path [OK]
Common Mistakes:
  • Including http:// or /metrics in targets
  • Using wrong YAML keys like scrape_jobs or jobs
  • Misnaming job_name or static_configs
3. Given this Prometheus query: rate(http_requests_total[5m]), what does it calculate?
medium
A. The average rate of HTTP requests per second over the last 5 minutes
B. The current number of active HTTP requests
C. The total number of HTTP requests since service start
D. The maximum number of HTTP requests in the last 5 minutes

Solution

  1. Step 1: Understand rate() function

    The rate() function calculates the per-second average increase of a counter over a time window.
  2. Step 2: Apply to http_requests_total[5m]

    This means it measures how fast the total HTTP requests counter increased in the last 5 minutes, giving requests per second.
  3. Final Answer:

    The average rate of HTTP requests per second over the last 5 minutes -> Option A
  4. Quick Check:

    rate() = per-second average increase [OK]
Hint: rate() gives per-second average over time window [OK]
Common Mistakes:
  • Thinking rate() returns total count
  • Confusing rate() with current active requests
  • Assuming rate() returns max value
4. You configured Prometheus to scrape localhost:9090 but no metrics appear. Which fix is correct?
medium
A. Change target to localhost:9090/metrics in YAML
B. Remove job_name from config
C. Restart Prometheus to reload config
D. Add metrics_path: '/metrics' under the scrape job

Solution

  1. Step 1: Understand default metrics path

    Prometheus scrapes /metrics path by default, but if the service uses a different path, you must specify it.
  2. Step 2: Fix missing metrics path

    Adding metrics_path: '/metrics' explicitly tells Prometheus where to get metrics if not default or to confirm path.
  3. Final Answer:

    Add metrics_path: '/metrics' under the scrape job -> Option D
  4. Quick Check:

    metrics_path fixes scrape URL [OK]
Hint: Use metrics_path to set correct scrape URL path [OK]
Common Mistakes:
  • Adding path in targets instead of metrics_path
  • Restarting without config fix
  • Removing job_name breaks config
5. You want to monitor error rates in a microservice using Prometheus. The service exposes http_requests_total with labels status and method. Which query shows the error rate (status codes 500-599) over the last 10 minutes as a percentage of all requests?
hard
A. rate(http_requests_total{status=~"5.."}[10m]) / rate(http_requests_total[10m]) * 100
B. sum(rate(http_requests_total{status=~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100
C. sum(rate(http_requests_total{status=~"5.."}[10m])) * 100
D. sum(rate(http_requests_total{status!~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100

Solution

  1. Step 1: Filter error status codes 500-599

    Use regex status=~"5.." to select error codes in the 500 range.
  2. Step 2: Calculate error rate as percentage

    Sum the rate of error requests and divide by sum of all requests rate, then multiply by 100 for percentage.
  3. Final Answer:

    sum(rate(http_requests_total{status=~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100 -> Option B
  4. Quick Check:

    Error rate % = error requests / total requests * 100 [OK]
Hint: Sum rates before division for correct percentage [OK]
Common Mistakes:
  • Dividing single rates instead of sums
  • Using wrong label regex
  • Multiplying before division