0
0
Microservicessystem_design~7 mins

Metrics collection (Prometheus) in Microservices - System Design Guide

Choose your learning style9 modes available
Problem Statement
Without a centralized and efficient way to collect metrics, monitoring microservices becomes unreliable and slow. This leads to delayed detection of failures, poor understanding of system health, and difficulty in troubleshooting performance issues.
Solution
Prometheus solves this by scraping metrics from each microservice at regular intervals using a pull model. It stores these metrics as time-series data, allowing real-time querying and alerting. Each service exposes an HTTP endpoint with metrics in a standard format, enabling Prometheus to collect and aggregate data efficiently.
Architecture
Prometheus
Server
Alertmanager

This diagram shows Prometheus scraping metrics from microservices exposing HTTP endpoints and forwarding alerts to Alertmanager.

Trade-offs
✓ Pros
Pull-based scraping allows Prometheus to control when and how often metrics are collected.
Time-series storage enables efficient querying and historical analysis of metrics.
Standardized metrics format simplifies integration with diverse microservices.
Built-in alerting supports proactive incident response.
✗ Cons
Pull model requires services to expose HTTP endpoints, which may not be feasible in all environments.
High scrape frequency can increase network and CPU load on services.
Scaling Prometheus for very large environments requires federation or sharding, adding complexity.
Use Prometheus when you have multiple microservices that can expose HTTP endpoints and need real-time monitoring with alerting, especially at scales from hundreds to thousands of services.
Avoid Prometheus if your services cannot expose HTTP endpoints or if you have extremely high cardinality metrics that exceed Prometheus's storage and query capabilities.
Real World Examples
Netflix
Netflix uses Prometheus to monitor microservices performance and availability, enabling rapid detection of streaming issues.
Uber
Uber employs Prometheus to collect metrics from its ride-hailing microservices, supporting real-time alerting and capacity planning.
Spotify
Spotify integrates Prometheus to track service health and user request latencies across its music streaming platform.
Code Example
Before, the service did not expose any metrics. After applying Prometheus metrics collection, the service increments a counter on each request and exposes a /metrics HTTP endpoint that Prometheus can scrape.
Microservices
### Before: No metrics exposed
from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello():
    return "Hello World"

if __name__ == '__main__':
    app.run()


### After: Exposing Prometheus metrics
from flask import Flask
from prometheus_client import Counter, generate_latest, CONTENT_TYPE_LATEST

app = Flask(__name__)

REQUEST_COUNT = Counter('app_requests_total', 'Total HTTP Requests')

@app.route('/')
def hello():
    REQUEST_COUNT.inc()
    return "Hello World"

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run()
OutputSuccess
Alternatives
Pushgateway
Pushgateway allows services that cannot be scraped to push metrics to Prometheus instead of being pulled.
Use when: Use Pushgateway when services are short-lived or behind firewalls preventing Prometheus from scraping them.
StatsD
StatsD uses a push model where services send metrics to a daemon that aggregates and forwards them.
Use when: Choose StatsD when you prefer a push model or have legacy systems that cannot expose HTTP endpoints.
OpenTelemetry
OpenTelemetry provides a vendor-neutral framework for collecting metrics, traces, and logs, supporting multiple backends including Prometheus.
Use when: Use OpenTelemetry when you want unified observability data collection across metrics, traces, and logs.
Summary
Prometheus collects metrics by scraping HTTP endpoints exposed by microservices.
It stores metrics as time-series data enabling real-time monitoring and alerting.
Prometheus is widely used in microservices environments for scalable and reliable observability.