| Scale | Number of Services | Metrics Volume | Prometheus Instances | Storage | Network Traffic |
|---|---|---|---|---|---|
| 100 users | 10-20 microservices | ~10k metrics/min | 1 single Prometheus server | Few GBs (local disk) | Low (few MB/s) |
| 10K users | 50-100 microservices | ~1M metrics/min | 1-2 Prometheus servers with federation | 100s GBs (local or network storage) | Moderate (10s MB/s) |
| 1M users | 200-500 microservices | ~100M metrics/min | Multiple Prometheus servers with sharding and federation | TBs (networked storage or remote storage) | High (100s MB/s) |
| 100M users | 1000+ microservices | Billions of metrics/min | Highly distributed Prometheus with remote write to scalable TSDB (e.g., Cortex, Thanos) | Multiple TBs to PBs (cloud storage) | Very high (Gbps range) |
Metrics collection (Prometheus) in Microservices - Scalability & System Analysis
The first bottleneck is the Prometheus server's ability to scrape and store metrics.
At small scale, a single Prometheus instance can handle scraping and storing metrics.
As the number of microservices and metrics grows, the server CPU, memory, and disk I/O become overwhelmed.
Network bandwidth can also become a bottleneck when scraping many endpoints frequently.
- Horizontal scaling: Run multiple Prometheus servers, each scraping a subset of services (sharding).
- Federation: Use Prometheus federation to aggregate metrics from multiple servers.
- Remote storage: Offload long-term storage to scalable time-series databases like Cortex or Thanos.
- Caching and scraping interval tuning: Reduce scrape frequency or cache metrics to reduce load.
- Network optimization: Use service discovery and scrape targets efficiently to reduce network overhead.
Assuming 500 microservices, each exposing 100 metrics, scraped every 15 seconds:
- Metrics per scrape: 500 * 100 = 50,000
- Scrapes per minute: 60 / 15 = 4
- Total metrics per minute: 50,000 * 4 = 200,000
- Prometheus can handle ~10,000-50,000 metrics per second on a single server.
- Storage needed: 200,000 metrics/min * 60 min * 24 hr * 30 days ≈ 864 billion data points/month.
- Network bandwidth: Each metric ~100 bytes, so 200,000 * 100 bytes = ~20 MB/min (~333 KB/s).
Start by explaining the data flow: how Prometheus scrapes metrics from microservices.
Discuss the limits of a single Prometheus server and identify bottlenecks.
Then propose scaling strategies like sharding, federation, and remote storage.
Highlight trade-offs such as complexity vs. scalability.
Your Prometheus server handles 1000 queries per second (QPS). Traffic grows 10x. What do you do first?
Answer: Introduce horizontal scaling by splitting scrape targets across multiple Prometheus instances (sharding) and use federation to aggregate metrics.