| Scale | What Changes? |
|---|---|
| 100 users | Basic SLIs measured manually or with simple tools; SLOs set loosely; SLA agreements informal or simple. |
| 10,000 users | Automated monitoring for SLIs; SLOs become formal targets; SLAs documented with penalties; alerting systems introduced. |
| 1,000,000 users | High precision SLIs with real-time dashboards; strict SLOs with error budgets; SLAs legally binding; multi-region monitoring. |
| 100,000,000 users | Distributed SLIs across services; dynamic SLO adjustments; SLAs include complex compliance; advanced anomaly detection. |
SLA, SLO, and SLI definitions in HLD - Scalability & System Analysis
At small scale, the main bottleneck is lack of precise measurement tools for SLIs, causing unclear SLOs and SLA enforcement.
As scale grows, bottlenecks shift to monitoring infrastructure capacity and data aggregation delays, making real-time SLI tracking difficult.
- Automate Monitoring: Use scalable monitoring tools (e.g., Prometheus, Datadog) to collect SLIs efficiently.
- Distributed Tracing: Implement tracing to measure SLIs across microservices.
- Alerting and Error Budgets: Use SLO-based alerting to focus on meaningful issues.
- Data Aggregation: Use time-series databases and aggregation to handle large volumes of SLI data.
- Legal and Compliance: Scale SLA management with contract automation tools.
For 1 million users generating 100 requests per second (RPS):
- Monitoring system must handle 100 million RPS metrics collection plus aggregation.
- Storage for SLI data: assuming 10 metrics per request, 100 million RPS * 10 metrics * 86400 seconds/day ≈ 86 trillion data points/day.
- Bandwidth for monitoring data: ~10 KB per metric point -> ~860 PB/day.
- Costs grow with retention period and granularity of SLIs.
When discussing SLA, SLO, and SLI scalability, start by defining each term clearly.
Explain how measurement precision and monitoring infrastructure evolve with scale.
Discuss bottlenecks in data collection and aggregation, then propose automation and distributed monitoring solutions.
Use real numbers to show understanding of volume and cost impact.
Your database handles 1000 QPS for storing SLI data. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Introduce caching and aggregation to reduce write load, then add read replicas or partition the database to handle increased QPS.