Microservicessystem_design~15 mins

Metrics collection (Prometheus) in Microservices - Deep Dive

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - Metrics collection (Prometheus)

What is it?

Metrics collection with Prometheus is a way to gather and store data about how software systems perform. It helps track things like how many requests a service gets, how long they take, and if errors happen. Prometheus is a tool that collects this data regularly and lets you ask questions about it later. This helps teams understand their systems and fix problems quickly.

Why it matters

Without metrics collection, teams would be blind to how their software behaves in real life. Problems like slow responses or crashes could go unnoticed until users complain. Prometheus solves this by giving clear, timely insights into system health. This means faster fixes, better reliability, and happier users.

Where it fits

Before learning Prometheus, you should understand basic microservices and how software systems communicate. After mastering Prometheus, you can explore alerting systems, dashboards like Grafana, and advanced monitoring strategies like distributed tracing.

Mental Model

Core Idea

Prometheus regularly pulls performance data from services to build a time-based picture of system health.

Think of it like...

Imagine a weather station that checks temperature, wind, and rain every few minutes to predict the weather. Prometheus is like that station, but for software systems, checking their 'health signs' regularly.

┌───────────────┐      scrape       ┌───────────────┐
│   Prometheus  │◀───────────────▶│   Microservice │
│   Server      │                 │   Exporter    │
└───────────────┘                 └───────────────┘
        │
        │ stores time-series data
        ▼
┌─────────────────────────────┐
│ Time-Series Database         │
│ (metrics over time)          │
└─────────────────────────────┘

Build-Up - 7 Steps

FoundationWhat Are Metrics and Why Collect Them

Concept: Introduce the idea of metrics as numbers that describe system behavior.

Metrics are measurements like how many users visit a website or how long a request takes. Collecting these helps us understand if a system is working well or if there are problems. Without metrics, we guess about system health instead of knowing.

Result

You understand that metrics are essential signals about system performance and reliability.

Knowing what metrics represent is the first step to monitoring systems effectively.

FoundationPrometheus Basics and Data Model

IntermediateHow Prometheus Scrapes Metrics from Services

IntermediateCommon Metric Types and Their Uses

IntermediateLabeling Metrics for Detailed Insights

AdvancedScaling Prometheus for Large Systems

ExpertHandling High Cardinality and Performance Challenges

Under the Hood

Prometheus works by periodically sending HTTP requests to configured endpoints called exporters. These exporters expose metrics in a text format that Prometheus parses into time-series data. Each metric is stored with a timestamp and optional labels in an efficient database optimized for time-series queries. Prometheus uses a query language called PromQL to retrieve and aggregate this data. Internally, it manages memory and disk storage to balance performance and retention.

Why designed this way?

Prometheus was designed for reliability and simplicity. The pull model avoids the complexity of services pushing data and lets Prometheus control collection timing. Time-series storage fits the nature of monitoring data, which changes over time. Labels provide flexible metadata without rigid schemas. Alternatives like push-based systems were rejected to reduce coupling and improve fault tolerance.

┌───────────────┐       scrape        ┌───────────────┐
│   Prometheus  │◀──────────────────▶│   Exporter    │
│   Server      │                    │ (metrics HTTP)│
└───────────────┘                    └───────────────┘
        │
        │ stores
        ▼
┌─────────────────────────────┐
│ Time-Series Database         │
│ - Metrics with timestamps    │
│ - Labels for metadata        │
└─────────────────────────────┘
        │
        │ queries with PromQL
        ▼
┌─────────────────────────────┐
│ Query Engine & Alerting      │
└─────────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does Prometheus push metrics to a central server or pull them? Commit to your answer.

Common Belief:Prometheus pushes metrics from services to a central server automatically.

Tap to reveal reality

Quick: Is adding more labels always better for monitoring detail? Commit to your answer.

Common Belief:More labels always improve monitoring by adding detail and context.

Tap to reveal reality

Quick: Can Prometheus store logs and traces as well as metrics? Commit to your answer.

Common Belief:Prometheus can store all types of monitoring data including logs and traces.

Tap to reveal reality

Quick: Does Prometheus automatically alert you when something goes wrong? Commit to your answer.

Common Belief:Prometheus sends alerts automatically without extra configuration.

Tap to reveal reality

Expert Zone

Prometheus's pull model simplifies network security by requiring only the monitoring server to initiate connections, reducing firewall complexity.

Relabeling rules in Prometheus allow dynamic modification of labels during scraping, enabling flexible metric management without changing service code.

Prometheus's local storage uses a custom time-series database optimized for fast writes and queries, but integrating remote storage is essential for long-term retention.

When NOT to use

Prometheus is not ideal for very high cardinality data like per-user metrics or detailed logs. In such cases, specialized systems like OpenTelemetry with backend storage or log management tools should be used instead.

Production Patterns

In production, Prometheus is often paired with exporters for databases, message queues, and hardware metrics. Federation aggregates metrics across clusters. Alertmanager handles notifications. Grafana visualizes data. These patterns create a robust monitoring ecosystem.

Connections

Distributed Tracing

Complementary observability technique

Understanding metrics collection helps interpret tracing data by providing quantitative context to request flows.

Time-Series Databases

Prometheus uses a specialized time-series database internally

Knowing time-series database principles clarifies how Prometheus stores and queries metrics efficiently.

Supply Chain Management

Both track flow and status over time for complex systems

Seeing metrics collection like supply chain tracking reveals how monitoring ensures smooth operation by spotting bottlenecks early.

Common Pitfalls

#1Exposing metrics without authentication in a public network

Wrong approach:http://myservice.com/metrics exposed openly without any access control

Correct approach:Use network policies or authentication proxies to restrict access to http://myservice.com/metrics

Root cause:Misunderstanding that metrics endpoints can leak sensitive information if not secured.

#2Using high-cardinality labels like user IDs in metrics

Wrong approach:http_requests_total{user_id="12345"} 1

Correct approach:http_requests_total{endpoint="/login",status="200"} 1

Root cause:Not realizing that unique labels multiply time-series and overload Prometheus.

#3Expecting Prometheus to store logs alongside metrics

Wrong approach:Trying to push log data into Prometheus metrics format

Correct approach:Use dedicated log systems like Loki for logs and keep Prometheus for numeric metrics

Root cause:Confusing different observability data types and tool purposes.

Key Takeaways

Prometheus collects numeric metrics by regularly pulling data from services, building a time-series database of system health.

Labels add important context to metrics but must be used carefully to avoid performance issues from high cardinality.

Prometheus is designed for reliability and simplicity, using a pull model and a specialized storage engine for efficient monitoring.

Scaling Prometheus requires federation or remote storage to handle large systems and data volumes.

Effective monitoring with Prometheus involves securing metrics endpoints, choosing appropriate metric types, and integrating alerting and visualization tools.

Practice

(1/5)

1. What is the main purpose of Prometheus in a microservices environment?

easy

A. To collect and store metrics from services for monitoring

B. To deploy microservices automatically

C. To manage user authentication

D. To serve web pages to users

Metrics collection (Prometheus) in Microservices - Deep Dive

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand rate() function

Step 2: Apply to http_requests_total[5m]

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`