| Scale | Number of Services | Metrics Volume | Prometheus Instances | Storage | Network Traffic |
|---|---|---|---|---|---|
| 100 users | 10-20 microservices | ~10k metrics/min | 1 single Prometheus server | Few GBs (local disk) | Low (few MB/s) |
| 10K users | 50-100 microservices | ~1M metrics/min | 1-2 Prometheus servers with federation | 100s GBs (local or network storage) | Moderate (10s MB/s) |
| 1M users | 200-500 microservices | ~100M metrics/min | Multiple Prometheus servers with sharding and federation | TBs (networked storage or remote storage) | High (100s MB/s) |
| 100M users | 1000+ microservices | Billions of metrics/min | Highly distributed Prometheus with remote write to scalable TSDB (e.g., Cortex, Thanos) | Multiple TBs to PBs (cloud storage) | Very high (Gbps range) |
Metrics collection (Prometheus) in Microservices - Scalability & System Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
The first bottleneck is the Prometheus server's ability to scrape and store metrics.
At small scale, a single Prometheus instance can handle scraping and storing metrics.
As the number of microservices and metrics grows, the server CPU, memory, and disk I/O become overwhelmed.
Network bandwidth can also become a bottleneck when scraping many endpoints frequently.
- Horizontal scaling: Run multiple Prometheus servers, each scraping a subset of services (sharding).
- Federation: Use Prometheus federation to aggregate metrics from multiple servers.
- Remote storage: Offload long-term storage to scalable time-series databases like Cortex or Thanos.
- Caching and scraping interval tuning: Reduce scrape frequency or cache metrics to reduce load.
- Network optimization: Use service discovery and scrape targets efficiently to reduce network overhead.
Assuming 500 microservices, each exposing 100 metrics, scraped every 15 seconds:
- Metrics per scrape: 500 * 100 = 50,000
- Scrapes per minute: 60 / 15 = 4
- Total metrics per minute: 50,000 * 4 = 200,000
- Prometheus can handle ~10,000-50,000 metrics per second on a single server.
- Storage needed: 200,000 metrics/min * 60 min * 24 hr * 30 days ≈ 864 billion data points/month.
- Network bandwidth: Each metric ~100 bytes, so 200,000 * 100 bytes = ~20 MB/min (~333 KB/s).
Start by explaining the data flow: how Prometheus scrapes metrics from microservices.
Discuss the limits of a single Prometheus server and identify bottlenecks.
Then propose scaling strategies like sharding, federation, and remote storage.
Highlight trade-offs such as complexity vs. scalability.
Your Prometheus server handles 1000 queries per second (QPS). Traffic grows 10x. What do you do first?
Answer: Introduce horizontal scaling by splitting scrape targets across multiple Prometheus instances (sharding) and use federation to aggregate metrics.
Practice
Solution
Step 1: Understand Prometheus role
Prometheus is designed to collect numerical data called metrics from running services.Step 2: Identify monitoring purpose
These metrics help monitor service health and performance in microservices.Final Answer:
To collect and store metrics from services for monitoring -> Option AQuick Check:
Prometheus = Metrics collection [OK]
- Confusing Prometheus with deployment tools
- Thinking Prometheus manages users
- Assuming Prometheus serves web content
http://localhost:8080/metrics?Solution
Step 1: Check Prometheus YAML syntax
Prometheus usesscrape_configswithjob_nameandstatic_configslistingtargetsas host:port without URL path.Step 2: Validate target format
Targets must be host:port only, no http:// or path like /metrics.Final Answer:
scrape_configs: - job_name: 'myservice' static_configs: - targets: ['localhost:8080'] -> Option CQuick Check:
Targets = host:port only [OK]
- Including http:// or /metrics in targets
- Using wrong YAML keys like scrape_jobs or jobs
- Misnaming job_name or static_configs
rate(http_requests_total[5m]), what does it calculate?Solution
Step 1: Understand
Therate()functionrate()function calculates the per-second average increase of a counter over a time window.Step 2: Apply to
This means it measures how fast the total HTTP requests counter increased in the last 5 minutes, giving requests per second.http_requests_total[5m]Final Answer:
The average rate of HTTP requests per second over the last 5 minutes -> Option AQuick Check:
rate() = per-second average increase [OK]
- Thinking rate() returns total count
- Confusing rate() with current active requests
- Assuming rate() returns max value
localhost:9090 but no metrics appear. Which fix is correct?Solution
Step 1: Understand default metrics path
Prometheus scrapes/metricspath by default, but if the service uses a different path, you must specify it.Step 2: Fix missing metrics path
Addingmetrics_path: '/metrics'explicitly tells Prometheus where to get metrics if not default or to confirm path.Final Answer:
Addmetrics_path: '/metrics'under the scrape job -> Option DQuick Check:
metrics_path fixes scrape URL [OK]
- Adding path in targets instead of metrics_path
- Restarting without config fix
- Removing job_name breaks config
http_requests_total with labels status and method. Which query shows the error rate (status codes 500-599) over the last 10 minutes as a percentage of all requests?Solution
Step 1: Filter error status codes 500-599
Use regexstatus=~"5.."to select error codes in the 500 range.Step 2: Calculate error rate as percentage
Sum the rate of error requests and divide by sum of all requests rate, then multiply by 100 for percentage.Final Answer:
sum(rate(http_requests_total{status=~"5.."}[10m])) / sum(rate(http_requests_total[10m])) * 100 -> Option BQuick Check:
Error rate % = error requests / total requests * 100 [OK]
- Dividing single rates instead of sums
- Using wrong label regex
- Multiplying before division
