| Users / Dashboards | 100 Users | 10,000 Users | 1 Million Users | 100 Million Users |
|---|---|---|---|---|
| Dashboard Views per Second | ~10-50 | ~1,000-5,000 | ~100,000 | ~10,000,000+ |
| Data Sources Queries per Second | ~100-500 | ~10,000-50,000 | ~1,000,000+ | ~100,000,000+ |
| Grafana Servers Needed | 1-2 | 10-20 | 200-300 | Thousands (Cloud scale) |
| Database Load (Metrics DB) | Low | Moderate | High - requires sharding | Very High - multi-region sharding |
| Cache Usage | Minimal | Important for performance | Critical - aggressive caching | Essential - multi-layer caching |
| Network Bandwidth | Low | Moderate | High | Very High - CDN and edge needed |
Dashboards (Grafana) in Microservices - Scalability & System Analysis
The first bottleneck is the metrics database that stores and serves time-series data queried by Grafana dashboards. At low scale, the database handles queries easily. As users and dashboards grow, query volume spikes, causing slow responses and timeouts. This happens because time-series databases have limits on query throughput and storage I/O.
- Read Replicas: Add replicas of the metrics database to distribute read queries.
- Caching: Use in-memory caches (e.g., Redis) to store frequent query results and reduce DB load.
- Sharding: Partition metrics data by time or tenant to spread load across multiple DB instances.
- Horizontal Scaling: Add more Grafana servers behind a load balancer to handle more dashboard requests.
- CDN and Edge Caching: Cache static dashboard assets and some query results closer to users to reduce latency and bandwidth.
- Query Optimization: Limit dashboard refresh rates and optimize queries to reduce expensive DB operations.
Assuming 10,000 users with 5 dashboards each refreshing every 30 seconds:
- Dashboard views per second = (10,000 users * 5 dashboards) / 30s ≈ 1,667 QPS
- Each dashboard triggers ~5 queries → DB queries ≈ 8,335 QPS
- Storage: Metrics data grows ~1GB per day per 1,000 users → ~10GB/day for 10,000 users
- Network bandwidth: Dashboard data + assets ~100KB per view → ~166 MB/s outgoing bandwidth
Start by identifying the main components: Grafana servers, metrics database, caching layers, and network. Discuss how user growth increases dashboard views and DB queries. Highlight the database as the first bottleneck and propose solutions like read replicas and caching. Mention horizontal scaling of Grafana servers and CDN for static assets. Always quantify load and explain trade-offs clearly.
Your metrics database handles 1,000 queries per second (QPS). Traffic grows 10x to 10,000 QPS. What do you do first and why?
Answer: Add read replicas to distribute the increased read query load and implement caching for frequent queries to reduce direct database hits. This addresses the immediate bottleneck without major redesign.