0
0
Microservicessystem_design~10 mins

Metrics collection (Prometheus) in Microservices - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Metrics collection (Prometheus)
Growth Table: Metrics Collection with Prometheus
ScaleNumber of ServicesMetrics VolumePrometheus InstancesStorageNetwork Traffic
100 users10-20 microservices~10k metrics/min1 single Prometheus serverFew GBs (local disk)Low (few MB/s)
10K users50-100 microservices~1M metrics/min1-2 Prometheus servers with federation100s GBs (local or network storage)Moderate (10s MB/s)
1M users200-500 microservices~100M metrics/minMultiple Prometheus servers with sharding and federationTBs (networked storage or remote storage)High (100s MB/s)
100M users1000+ microservicesBillions of metrics/minHighly distributed Prometheus with remote write to scalable TSDB (e.g., Cortex, Thanos)Multiple TBs to PBs (cloud storage)Very high (Gbps range)
First Bottleneck

The first bottleneck is the Prometheus server's ability to scrape and store metrics.

At small scale, a single Prometheus instance can handle scraping and storing metrics.

As the number of microservices and metrics grows, the server CPU, memory, and disk I/O become overwhelmed.

Network bandwidth can also become a bottleneck when scraping many endpoints frequently.

Scaling Solutions
  • Horizontal scaling: Run multiple Prometheus servers, each scraping a subset of services (sharding).
  • Federation: Use Prometheus federation to aggregate metrics from multiple servers.
  • Remote storage: Offload long-term storage to scalable time-series databases like Cortex or Thanos.
  • Caching and scraping interval tuning: Reduce scrape frequency or cache metrics to reduce load.
  • Network optimization: Use service discovery and scrape targets efficiently to reduce network overhead.
Back-of-Envelope Cost Analysis

Assuming 500 microservices, each exposing 100 metrics, scraped every 15 seconds:

  • Metrics per scrape: 500 * 100 = 50,000
  • Scrapes per minute: 60 / 15 = 4
  • Total metrics per minute: 50,000 * 4 = 200,000
  • Prometheus can handle ~10,000-50,000 metrics per second on a single server.
  • Storage needed: 200,000 metrics/min * 60 min * 24 hr * 30 days ≈ 864 billion data points/month.
  • Network bandwidth: Each metric ~100 bytes, so 200,000 * 100 bytes = ~20 MB/min (~333 KB/s).
Interview Tip

Start by explaining the data flow: how Prometheus scrapes metrics from microservices.

Discuss the limits of a single Prometheus server and identify bottlenecks.

Then propose scaling strategies like sharding, federation, and remote storage.

Highlight trade-offs such as complexity vs. scalability.

Self Check

Your Prometheus server handles 1000 queries per second (QPS). Traffic grows 10x. What do you do first?

Answer: Introduce horizontal scaling by splitting scrape targets across multiple Prometheus instances (sharding) and use federation to aggregate metrics.

Key Result
Prometheus scales well initially but hits CPU, memory, and storage limits as metrics grow; horizontal sharding and remote storage are key to scaling beyond millions of metrics.