Microservicessystem_design~10 mins

Metrics collection (Prometheus) in Microservices - Scalability & System Analysis

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Scalability Analysis - Metrics collection (Prometheus)

Growth Table: Metrics Collection with Prometheus

Scale	Number of Services	Metrics Volume	Prometheus Instances	Storage	Network Traffic
100 users	10-20 microservices	~10k metrics/min	1 single Prometheus server	Few GBs (local disk)	Low (few MB/s)
10K users	50-100 microservices	~1M metrics/min	1-2 Prometheus servers with federation	100s GBs (local or network storage)	Moderate (10s MB/s)
1M users	200-500 microservices	~100M metrics/min	Multiple Prometheus servers with sharding and federation	TBs (networked storage or remote storage)	High (100s MB/s)
100M users	1000+ microservices	Billions of metrics/min	Highly distributed Prometheus with remote write to scalable TSDB (e.g., Cortex, Thanos)	Multiple TBs to PBs (cloud storage)	Very high (Gbps range)

First Bottleneck

The first bottleneck is the Prometheus server's ability to scrape and store metrics.

At small scale, a single Prometheus instance can handle scraping and storing metrics.

As the number of microservices and metrics grows, the server CPU, memory, and disk I/O become overwhelmed.

Network bandwidth can also become a bottleneck when scraping many endpoints frequently.

Scaling Solutions

Horizontal scaling: Run multiple Prometheus servers, each scraping a subset of services (sharding).
Federation: Use Prometheus federation to aggregate metrics from multiple servers.
Remote storage: Offload long-term storage to scalable time-series databases like Cortex or Thanos.
Caching and scraping interval tuning: Reduce scrape frequency or cache metrics to reduce load.
Network optimization: Use service discovery and scrape targets efficiently to reduce network overhead.

Back-of-Envelope Cost Analysis

Assuming 500 microservices, each exposing 100 metrics, scraped every 15 seconds:

Metrics per scrape: 500 * 100 = 50,000
Scrapes per minute: 60 / 15 = 4
Total metrics per minute: 50,000 * 4 = 200,000
Prometheus can handle ~10,000-50,000 metrics per second on a single server.
Storage needed: 200,000 metrics/min * 60 min * 24 hr * 30 days ≈ 864 billion data points/month.
Network bandwidth: Each metric ~100 bytes, so 200,000 * 100 bytes = ~20 MB/min (~333 KB/s).

Interview Tip

Start by explaining the data flow: how Prometheus scrapes metrics from microservices.

Discuss the limits of a single Prometheus server and identify bottlenecks.

Then propose scaling strategies like sharding, federation, and remote storage.

Highlight trade-offs such as complexity vs. scalability.

Self Check

Your Prometheus server handles 1000 queries per second (QPS). Traffic grows 10x. What do you do first?

Answer: Introduce horizontal scaling by splitting scrape targets across multiple Prometheus instances (sharding) and use federation to aggregate metrics.

Key Result

Prometheus scales well initially but hits CPU, memory, and storage limits as metrics grow; horizontal sharding and remote storage are key to scaling beyond millions of metrics.

Practice

(1/5)

1. What is the main purpose of Prometheus in a microservices environment?

easy

A. To collect and store metrics from services for monitoring

B. To deploy microservices automatically

C. To manage user authentication

D. To serve web pages to users

Metrics collection (Prometheus) in Microservices - Scalability & System Analysis

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand Prometheus role

Step 2: Identify monitoring purpose

Final Answer:

Quick Check:

Solution

Step 1: Check Prometheus YAML syntax

Step 2: Validate target format

Final Answer:

Quick Check:

Solution

Step 1: Understand rate() function

Step 2: Apply to http_requests_total[5m]

Final Answer:

Quick Check:

Solution

Step 1: Understand default metrics path

Step 2: Fix missing metrics path

Final Answer:

Quick Check:

Solution

Step 1: Filter error status codes 500-599

Step 2: Calculate error rate as percentage

Final Answer:

Quick Check:

Step 1: Understand `rate()` function

Step 2: Apply to `http_requests_total[5m]`