| Users / Services | 100 Services | 10,000 Services | 1,000,000 Services | 100,000,000 Services |
|---|---|---|---|---|
| Health Check Requests per Second | ~100-500 req/s | ~10,000-50,000 req/s | ~1,000,000-5,000,000 req/s | ~100,000,000-500,000,000 req/s |
| Monitoring System Load | Single monitoring server can handle | Requires distributed monitoring clusters | Needs hierarchical monitoring with aggregation | Global distributed monitoring with regional aggregation |
| Network Bandwidth | Low, manageable on standard network | Moderate, requires optimized network | High, needs dedicated network infrastructure | Very high, requires CDN and edge computing |
| Data Storage for Logs | Small, local storage sufficient | Medium, needs centralized log storage | Large, requires scalable storage solutions | Massive, needs tiered and archival storage |
| Alerting Frequency | Manual or simple automated alerts | Automated alerts with thresholds | AI-assisted anomaly detection | Advanced predictive analytics and automation |
Health check pattern in Microservices - Scalability & System Analysis
The first bottleneck is the monitoring system's ability to process and aggregate health check requests as the number of services grows.
At small scale, a single monitoring server can poll all services easily.
At medium scale (~10,000 services), the monitoring server CPU and network bandwidth become saturated.
At large scale, the volume of health check data overwhelms storage and network, causing delays and missed alerts.
- Horizontal Scaling: Add multiple monitoring servers to distribute health check load.
- Hierarchical Health Checks: Use local aggregators to collect health data from a subset of services, then forward summaries upstream.
- Adaptive Health Check Frequency: Reduce check frequency for stable services to lower load.
- Caching and Event-Driven Checks: Use event triggers for health status changes instead of constant polling.
- Efficient Protocols: Use lightweight protocols like gRPC or UDP for health checks to reduce overhead.
- Data Storage Optimization: Archive old health data and use tiered storage to manage volume.
- Network Optimization: Use edge monitoring and CDNs to reduce network load.
Assuming each health check request is ~1 KB:
- At 10,000 services, with 1 check per 10 seconds: 1,000 req/s -> ~1 MB/s bandwidth.
- At 1,000,000 services, same frequency: 100,000 req/s -> ~100 MB/s bandwidth.
- Storage for logs: If storing 1 month of health data at 1 KB per check, 1,000,000 services checked every 10 seconds -> ~259 TB/month.
- Monitoring servers: Each can handle ~5,000 concurrent health checks per second; thus, 20 servers needed for 100,000 req/s.
Start by explaining the health check pattern and its purpose.
Discuss how load grows with number of services and check frequency.
Identify the monitoring system as the first bottleneck.
Propose scaling solutions step-by-step: horizontal scaling, aggregation, adaptive checks.
Use numbers to justify your approach and show understanding of trade-offs.
Your monitoring database handles 1000 QPS for health checks. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Implement horizontal scaling by adding read replicas or multiple monitoring servers to distribute the load, and introduce aggregation layers to reduce direct queries to the database.