0
0
Microservicessystem_design~15 mins

Metrics collection (Prometheus) in Microservices - Deep Dive

Choose your learning style9 modes available
Overview - Metrics collection (Prometheus)
What is it?
Metrics collection with Prometheus is a way to gather and store data about how software systems perform. It helps track things like how many requests a service gets, how long they take, and if errors happen. Prometheus is a tool that collects this data regularly and lets you ask questions about it later. This helps teams understand their systems and fix problems quickly.
Why it matters
Without metrics collection, teams would be blind to how their software behaves in real life. Problems like slow responses or crashes could go unnoticed until users complain. Prometheus solves this by giving clear, timely insights into system health. This means faster fixes, better reliability, and happier users.
Where it fits
Before learning Prometheus, you should understand basic microservices and how software systems communicate. After mastering Prometheus, you can explore alerting systems, dashboards like Grafana, and advanced monitoring strategies like distributed tracing.
Mental Model
Core Idea
Prometheus regularly pulls performance data from services to build a time-based picture of system health.
Think of it like...
Imagine a weather station that checks temperature, wind, and rain every few minutes to predict the weather. Prometheus is like that station, but for software systems, checking their 'health signs' regularly.
┌───────────────┐      scrape       ┌───────────────┐
│   Prometheus  │◀───────────────▶│   Microservice │
│   Server      │                 │   Exporter    │
└───────────────┘                 └───────────────┘
        │
        │ stores time-series data
        ▼
┌─────────────────────────────┐
│ Time-Series Database         │
│ (metrics over time)          │
└─────────────────────────────┘
Build-Up - 7 Steps
1
FoundationWhat Are Metrics and Why Collect Them
🤔
Concept: Introduce the idea of metrics as numbers that describe system behavior.
Metrics are measurements like how many users visit a website or how long a request takes. Collecting these helps us understand if a system is working well or if there are problems. Without metrics, we guess about system health instead of knowing.
Result
You understand that metrics are essential signals about system performance and reliability.
Knowing what metrics represent is the first step to monitoring systems effectively.
2
FoundationPrometheus Basics and Data Model
🤔
Concept: Explain Prometheus as a tool that collects and stores metrics as time-series data with labels.
Prometheus collects metrics by asking services for their current data at regular intervals. It stores this data as time-series, which means each metric has a value and a timestamp. Labels add extra info like which server or region the data is from.
Result
You see how Prometheus organizes data to track changes over time and across different parts of a system.
Understanding the time-series and labels model is key to querying and analyzing metrics later.
3
IntermediateHow Prometheus Scrapes Metrics from Services
🤔Before reading on: do you think Prometheus waits for services to send data, or does it ask them regularly? Commit to your answer.
Concept: Prometheus uses a pull model, where it regularly requests metrics from services instead of waiting for them to send data.
Each service exposes an endpoint called an exporter that shows current metrics in a format Prometheus understands. Prometheus visits these endpoints on a schedule to collect fresh data. This pull approach helps Prometheus control when and how often it collects data.
Result
You know that Prometheus actively scrapes metrics, which affects how services expose their data.
Knowing the pull model explains why services must expose metrics endpoints and how Prometheus controls data freshness.
4
IntermediateCommon Metric Types and Their Uses
🤔Before reading on: which metric type do you think counts events, and which measures durations? Commit to your answer.
Concept: Prometheus supports counters, gauges, histograms, and summaries, each suited for different measurement needs.
Counters only go up and count things like requests served. Gauges can go up or down, like current memory usage. Histograms and summaries measure distributions, like request durations, showing how often requests fall into different time ranges.
Result
You can choose the right metric type to represent different system behaviors accurately.
Understanding metric types helps design meaningful metrics that reveal real system performance.
5
IntermediateLabeling Metrics for Detailed Insights
🤔
Concept: Labels add context to metrics, allowing filtering and grouping by attributes like service name or region.
For example, a metric counting requests can have labels for HTTP method (GET, POST) and status code (200, 404). This lets you ask questions like 'How many POST requests failed?' Labels make metrics flexible and powerful for analysis.
Result
You see how labels turn simple numbers into rich, queryable data.
Knowing how to use labels effectively unlocks detailed monitoring and troubleshooting.
6
AdvancedScaling Prometheus for Large Systems
🤔Before reading on: do you think one Prometheus server can handle all metrics in a big system, or do you need multiple? Commit to your answer.
Concept: Large systems require multiple Prometheus servers or remote storage to handle volume and reliability.
Prometheus servers can be federated, where one server scrapes others to aggregate data. Remote write allows sending metrics to scalable storage systems. These patterns help manage high data volumes and ensure monitoring stays reliable.
Result
You understand how to design Prometheus setups that grow with system size.
Knowing scaling patterns prevents bottlenecks and data loss in production monitoring.
7
ExpertHandling High Cardinality and Performance Challenges
🤔Before reading on: do you think adding many unique label values improves or harms Prometheus performance? Commit to your answer.
Concept: High cardinality (many unique label combinations) can cause performance issues and data explosion in Prometheus.
Each unique label set creates a new time-series, increasing memory and storage needs. Experts carefully design labels to avoid unnecessary uniqueness. Techniques like relabeling and metric aggregation reduce cardinality. Understanding this helps keep Prometheus efficient and stable.
Result
You can prevent common performance pitfalls and design metrics that scale well.
Recognizing the impact of cardinality is crucial for maintaining Prometheus performance in complex systems.
Under the Hood
Prometheus works by periodically sending HTTP requests to configured endpoints called exporters. These exporters expose metrics in a text format that Prometheus parses into time-series data. Each metric is stored with a timestamp and optional labels in an efficient database optimized for time-series queries. Prometheus uses a query language called PromQL to retrieve and aggregate this data. Internally, it manages memory and disk storage to balance performance and retention.
Why designed this way?
Prometheus was designed for reliability and simplicity. The pull model avoids the complexity of services pushing data and lets Prometheus control collection timing. Time-series storage fits the nature of monitoring data, which changes over time. Labels provide flexible metadata without rigid schemas. Alternatives like push-based systems were rejected to reduce coupling and improve fault tolerance.
┌───────────────┐       scrape        ┌───────────────┐
│   Prometheus  │◀──────────────────▶│   Exporter    │
│   Server      │                    │ (metrics HTTP)│
└───────────────┘                    └───────────────┘
        │
        │ stores
        ▼
┌─────────────────────────────┐
│ Time-Series Database         │
│ - Metrics with timestamps    │
│ - Labels for metadata        │
└─────────────────────────────┘
        │
        │ queries with PromQL
        ▼
┌─────────────────────────────┐
│ Query Engine & Alerting      │
└─────────────────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does Prometheus push metrics to a central server or pull them? Commit to your answer.
Common Belief:Prometheus pushes metrics from services to a central server automatically.
Tap to reveal reality
Reality:Prometheus uses a pull model, where it scrapes metrics by requesting them from services at intervals.
Why it matters:Assuming push leads to wrong setup and confusion about how to expose metrics, causing monitoring gaps.
Quick: Is adding more labels always better for monitoring detail? Commit to your answer.
Common Belief:More labels always improve monitoring by adding detail and context.
Tap to reveal reality
Reality:Too many labels create high cardinality, which harms Prometheus performance and can cause crashes.
Why it matters:Ignoring cardinality leads to system overload and unreliable monitoring data.
Quick: Can Prometheus store logs and traces as well as metrics? Commit to your answer.
Common Belief:Prometheus can store all types of monitoring data including logs and traces.
Tap to reveal reality
Reality:Prometheus is specialized for numeric time-series metrics; logs and traces require other tools like Loki or Jaeger.
Why it matters:Expecting Prometheus to handle all observability leads to incomplete monitoring and tool misuse.
Quick: Does Prometheus automatically alert you when something goes wrong? Commit to your answer.
Common Belief:Prometheus sends alerts automatically without extra configuration.
Tap to reveal reality
Reality:Prometheus requires separate alerting rules and integration with alert managers to send notifications.
Why it matters:Assuming automatic alerts causes missed incidents and delayed responses.
Expert Zone
1
Prometheus's pull model simplifies network security by requiring only the monitoring server to initiate connections, reducing firewall complexity.
2
Relabeling rules in Prometheus allow dynamic modification of labels during scraping, enabling flexible metric management without changing service code.
3
Prometheus's local storage uses a custom time-series database optimized for fast writes and queries, but integrating remote storage is essential for long-term retention.
When NOT to use
Prometheus is not ideal for very high cardinality data like per-user metrics or detailed logs. In such cases, specialized systems like OpenTelemetry with backend storage or log management tools should be used instead.
Production Patterns
In production, Prometheus is often paired with exporters for databases, message queues, and hardware metrics. Federation aggregates metrics across clusters. Alertmanager handles notifications. Grafana visualizes data. These patterns create a robust monitoring ecosystem.
Connections
Distributed Tracing
Complementary observability technique
Understanding metrics collection helps interpret tracing data by providing quantitative context to request flows.
Time-Series Databases
Prometheus uses a specialized time-series database internally
Knowing time-series database principles clarifies how Prometheus stores and queries metrics efficiently.
Supply Chain Management
Both track flow and status over time for complex systems
Seeing metrics collection like supply chain tracking reveals how monitoring ensures smooth operation by spotting bottlenecks early.
Common Pitfalls
#1Exposing metrics without authentication in a public network
Wrong approach:http://myservice.com/metrics exposed openly without any access control
Correct approach:Use network policies or authentication proxies to restrict access to http://myservice.com/metrics
Root cause:Misunderstanding that metrics endpoints can leak sensitive information if not secured.
#2Using high-cardinality labels like user IDs in metrics
Wrong approach:http_requests_total{user_id="12345"} 1
Correct approach:http_requests_total{endpoint="/login",status="200"} 1
Root cause:Not realizing that unique labels multiply time-series and overload Prometheus.
#3Expecting Prometheus to store logs alongside metrics
Wrong approach:Trying to push log data into Prometheus metrics format
Correct approach:Use dedicated log systems like Loki for logs and keep Prometheus for numeric metrics
Root cause:Confusing different observability data types and tool purposes.
Key Takeaways
Prometheus collects numeric metrics by regularly pulling data from services, building a time-series database of system health.
Labels add important context to metrics but must be used carefully to avoid performance issues from high cardinality.
Prometheus is designed for reliability and simplicity, using a pull model and a specialized storage engine for efficient monitoring.
Scaling Prometheus requires federation or remote storage to handle large systems and data volumes.
Effective monitoring with Prometheus involves securing metrics endpoints, choosing appropriate metric types, and integrating alerting and visualization tools.