Bird
Raised Fist0
Microservicessystem_design~10 mins

Metrics collection (Prometheus) in Microservices - Scalability & System Analysis

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Scalability Analysis - Metrics collection (Prometheus)
Growth Table: Metrics Collection with Prometheus
ScaleNumber of ServicesMetrics VolumePrometheus InstancesStorageNetwork Traffic
100 users10-20 microservices~10k metrics/min1 single Prometheus serverFew GBs (local disk)Low (few MB/s)
10K users50-100 microservices~1M metrics/min1-2 Prometheus servers with federation100s GBs (local or network storage)Moderate (10s MB/s)
1M users200-500 microservices~100M metrics/minMultiple Prometheus servers with sharding and federationTBs (networked storage or remote storage)High (100s MB/s)
100M users1000+ microservicesBillions of metrics/minHighly distributed Prometheus with remote write to scalable TSDB (e.g., Cortex, Thanos)Multiple TBs to PBs (cloud storage)Very high (Gbps range)
First Bottleneck

The first bottleneck is the Prometheus server's ability to scrape and store metrics.

At small scale, a single Prometheus instance can handle scraping and storing metrics.

As the number of microservices and metrics grows, the server CPU, memory, and disk I/O become overwhelmed.

Network bandwidth can also become a bottleneck when scraping many endpoints frequently.

Scaling Solutions
  • Horizontal scaling: Run multiple Prometheus servers, each scraping a subset of services (sharding).
  • Federation: Use Prometheus federation to aggregate metrics from multiple servers.
  • Remote storage: Offload long-term storage to scalable time-series databases like Cortex or Thanos.
  • Caching and scraping interval tuning: Reduce scrape frequency or cache metrics to reduce load.
  • Network optimization: Use service discovery and scrape targets efficiently to reduce network overhead.
Back-of-Envelope Cost Analysis

Assuming 500 microservices, each exposing 100 metrics, scraped every 15 seconds:

  • Metrics per scrape: 500 * 100 = 50,000
  • Scrapes per minute: 60 / 15 = 4
  • Total metrics per minute: 50,000 * 4 = 200,000
  • Prometheus can handle ~10,000-50,000 metrics per second on a single server.
  • Storage needed: 200,000 metrics/min * 60 min * 24 hr * 30 days ≈ 864 billion data points/month.
  • Network bandwidth: Each metric ~100 bytes, so 200,000 * 100 bytes = ~20 MB/min (~333 KB/s).
Interview Tip

Start by explaining the data flow: how Prometheus scrapes metrics from microservices.

Discuss the limits of a single Prometheus server and identify bottlenecks.

Then propose scaling strategies like sharding, federation, and remote storage.

Highlight trade-offs such as complexity vs. scalability.

Self Check

Your Prometheus server handles 1000 queries per second (QPS). Traffic grows 10x. What do you do first?

Answer: Introduce horizontal scaling by splitting scrape targets across multiple Prometheus instances (sharding) and use federation to aggregate metrics.

Key Result
Prometheus scales well initially but hits CPU, memory, and storage limits as metrics grow; horizontal sharding and remote storage are key to scaling beyond millions of metrics.

Practice

(1/5)
1. What is the main purpose of Prometheus in a microservices environment?
easy
A. To collect and store metrics from services for monitoring
B. To deploy microservices automatically
C. To manage user authentication
D. To serve web pages to users

Solution

  1. Step 1: Understand Prometheus role

    Prometheus is designed to collect numerical data called metrics from running services.
  2. Step 2: Identify monitoring purpose

    These metrics help monitor service health and performance in microservices.
  3. Final Answer:

    To collect and store metrics from services for monitoring -> Option A
  4. Quick Check:

    Prometheus = Metrics collection [OK]
Hint: Prometheus is for metrics, not deployment or auth [OK]
Common Mistakes:
  • Confusing Prometheus with deployment tools
  • Thinking Prometheus manages users
  • Assuming Prometheus serves web content
2. Which YAML configuration snippet correctly defines a Prometheus scrape job for a service at http://localhost:8080/metrics?
easy
A. jobs: - job: 'myservice' endpoints: ['localhost:8080']
B. scrape_configs: - job_name: 'myservice' static_configs: - targets: ['http://localhost:8080/metrics']
C. scrape_configs: - job_name: 'myservice' static_configs: - targets: ['localhost:8080']
D. scrape_jobs: - name: 'myservice' targets: ['localhost:8080/metrics']

Solution

  1. Step 1: Check Prometheus YAML syntax

    Prometheus uses scrape_configs with job_name and static_configs listing targets as host:port without URL path.
  2. Step 2: Validate target format

    Targets must be host:port only, no http:// or path like /metrics.
  3. Final Answer:

    scrape_configs: - job_name: 'myservice' static_configs: - targets: ['localhost:8080'] -> Option C
  4. Quick Check:

    Targets = host:port only [OK]
Hint: Targets list host:port only, no URL scheme or path [OK]
Common Mistakes:
  • Including http:// or /metrics in targets
  • Using wrong YAML keys like scrape_jobs or jobs
  • Misnaming job_name or static_configs
3. Given this Prometheus query: rate(http_requests_total[5m]), what does it calculate?
medium
A. The average rate of HTTP requests per second over the last 5 minutes
B. The current number of active HTTP requests
C. The total number of HTTP requests since service start
D. The maximum number of HTTP requests in the last 5 minutes

Solution

  1. Step 1: Understand rate() function

    The rate() function calculates the per-second average increase of a counter over a time window.
  2. Step 2: Apply to http_requests_total[5m]

    This means it measures how fast the total HTTP requests counter increased in the last 5 minutes, giving requests per second.
  3. Final Answer:

    The average rate of HTTP requests per second over the last 5 minutes -> Option A
  4. Quick Check:

    rate() = per-second average increase [OK]
Hint: rate() gives per-second average over time window [OK]
Common Mistakes:
  • Thinking rate() returns total count
  • Confusing rate() with current active requests
  • Assuming rate() returns max value
4. You configured Prometheus to scrape localhost:9090 but no metrics appear. Which fix is correct?
medium
A. Change target to localhost:9090/metrics in YAML
B. Remove job_name from config
C. Restart Prometheus to reload config
D. Add metrics_path: '/metrics' under the scrape job

Solution

  1. Step 1: Understand default metrics path

    Prometheus scrapes /metrics path by default, but if the service uses a different path, you must specify it.
  2. Step 2: Fix missing metrics path

    Adding metrics_path: '/metrics' explicitly tells Prometheus where to get metrics if not default or to confirm path.
  3. Final Answer:

    Add metrics_path: '/metrics' under the scrape job -> Option D
  4. Quick Check:

    metrics_path fixes scrape URL [OK]
Hint: Use metrics_path to set correct scrape URL path [OK]
Common Mistakes:
  • Adding path in targets instead of metrics_path
  • Restarting without config fix
  • Removing job_name breaks config
5. You want to monitor error rates in a microservice using Prometheus. The service exposes http_requests_total with labels status and method. Which query shows the error rate (status codes 500-599) over the last 10 minutes as a percentage of all requests?
hard
A. rate(http_requests_total{status=~"5.."}[10m]) / rate(http_requests_total[10m]) * 100
B. sum(rate(http_requests_total{status=~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100
C. sum(rate(http_requests_total{status=~"5.."}[10m])) * 100
D. sum(rate(http_requests_total{status!~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100

Solution

  1. Step 1: Filter error status codes 500-599

    Use regex status=~"5.." to select error codes in the 500 range.
  2. Step 2: Calculate error rate as percentage

    Sum the rate of error requests and divide by sum of all requests rate, then multiply by 100 for percentage.
  3. Final Answer:

    sum(rate(http_requests_total{status=~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100 -> Option B
  4. Quick Check:

    Error rate % = error requests / total requests * 100 [OK]
Hint: Sum rates before division for correct percentage [OK]
Common Mistakes:
  • Dividing single rates instead of sums
  • Using wrong label regex
  • Multiplying before division