Microservicessystem_design~7 mins

Metrics collection (Prometheus) in Microservices - System Design Guide

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Problem Statement

Without a centralized and efficient way to collect metrics, monitoring microservices becomes unreliable and slow. This leads to delayed detection of failures, poor understanding of system health, and difficulty in troubleshooting performance issues.

Solution

Prometheus solves this by scraping metrics from each microservice at regular intervals using a pull model. It stores these metrics as time-series data, allowing real-time querying and alerting. Each service exposes an HTTP endpoint with metrics in a standard format, enabling Prometheus to collect and aggregate data efficiently.

Architecture

Prometheus

Server

↓

Alertmanager

This diagram shows Prometheus scraping metrics from microservices exposing HTTP endpoints and forwarding alerts to Alertmanager.

Trade-offs

✓ Pros

→

Pull-based scraping allows Prometheus to control when and how often metrics are collected.

→

Time-series storage enables efficient querying and historical analysis of metrics.

→

Standardized metrics format simplifies integration with diverse microservices.

→

Built-in alerting supports proactive incident response.

✗ Cons

→

Pull model requires services to expose HTTP endpoints, which may not be feasible in all environments.

→

High scrape frequency can increase network and CPU load on services.

→

Scaling Prometheus for very large environments requires federation or sharding, adding complexity.

Use Prometheus when you have multiple microservices that can expose HTTP endpoints and need real-time monitoring with alerting, especially at scales from hundreds to thousands of services.

Avoid Prometheus if your services cannot expose HTTP endpoints or if you have extremely high cardinality metrics that exceed Prometheus's storage and query capabilities.

Real World Examples

Netflix

Netflix uses Prometheus to monitor microservices performance and availability, enabling rapid detection of streaming issues.

Uber

Uber employs Prometheus to collect metrics from its ride-hailing microservices, supporting real-time alerting and capacity planning.

Spotify

Spotify integrates Prometheus to track service health and user request latencies across its music streaming platform.

Code Example

Before, the service did not expose any metrics. After applying Prometheus metrics collection, the service increments a counter on each request and exposes a /metrics HTTP endpoint that Prometheus can scrape.

Microservices

### Before: No metrics exposed
from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello():
    return "Hello World"

if __name__ == '__main__':
    app.run()


### After: Exposing Prometheus metrics
from flask import Flask
from prometheus_client import Counter, generate_latest, CONTENT_TYPE_LATEST

app = Flask(__name__)

REQUEST_COUNT = Counter('app_requests_total', 'Total HTTP Requests')

@app.route('/')
def hello():
    REQUEST_COUNT.inc()
    return "Hello World"

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run()

OutputSuccess

Alternatives

Pushgateway

Pushgateway allows services that cannot be scraped to push metrics to Prometheus instead of being pulled.

Use when: Use Pushgateway when services are short-lived or behind firewalls preventing Prometheus from scraping them.

StatsD

StatsD uses a push model where services send metrics to a daemon that aggregates and forwards them.

Use when: Choose StatsD when you prefer a push model or have legacy systems that cannot expose HTTP endpoints.

OpenTelemetry

OpenTelemetry provides a vendor-neutral framework for collecting metrics, traces, and logs, supporting multiple backends including Prometheus.

Use when: Use OpenTelemetry when you want unified observability data collection across metrics, traces, and logs.

Summary

Prometheus collects metrics by scraping HTTP endpoints exposed by microservices.

It stores metrics as time-series data enabling real-time monitoring and alerting.

Prometheus is widely used in microservices environments for scalable and reliable observability.

Practice

(1/5)

1. What is the main purpose of Prometheus in a microservices environment?

easy

A. To collect and store metrics from services for monitoring

B. To deploy microservices automatically

C. To manage user authentication

D. To serve web pages to users

Metrics collection (Prometheus) in Microservices - System Design Guide

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand rate() function

Step 2: Apply to http_requests_total[5m]

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`