HLDsystem_design~10 mins

SLA, SLO, and SLI definitions in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - SLA, SLO, and SLI definitions

Growth Table: SLA, SLO, and SLI Definitions

Scale	What Changes?
100 users	Basic SLIs measured manually or with simple tools; SLOs set loosely; SLA agreements informal or simple.
10,000 users	Automated monitoring for SLIs; SLOs become formal targets; SLAs documented with penalties; alerting systems introduced.
1,000,000 users	High precision SLIs with real-time dashboards; strict SLOs with error budgets; SLAs legally binding; multi-region monitoring.
100,000,000 users	Distributed SLIs across services; dynamic SLO adjustments; SLAs include complex compliance; advanced anomaly detection.

First Bottleneck

At small scale, the main bottleneck is lack of precise measurement tools for SLIs, causing unclear SLOs and SLA enforcement.

As scale grows, bottlenecks shift to monitoring infrastructure capacity and data aggregation delays, making real-time SLI tracking difficult.

Scaling Solutions

Automate Monitoring: Use scalable monitoring tools (e.g., Prometheus, Datadog) to collect SLIs efficiently.
Distributed Tracing: Implement tracing to measure SLIs across microservices.
Alerting and Error Budgets: Use SLO-based alerting to focus on meaningful issues.
Data Aggregation: Use time-series databases and aggregation to handle large volumes of SLI data.
Legal and Compliance: Scale SLA management with contract automation tools.

Back-of-Envelope Cost Analysis

For 1 million users generating 100 requests per second (RPS):

Monitoring system must handle 100 million RPS metrics collection plus aggregation.
Storage for SLI data: assuming 10 metrics per request, 100 million RPS * 10 metrics * 86400 seconds/day ≈ 86 trillion data points/day.
Bandwidth for monitoring data: ~10 KB per metric point -> ~860 PB/day.
Costs grow with retention period and granularity of SLIs.

Interview Tip

When discussing SLA, SLO, and SLI scalability, start by defining each term clearly.

Explain how measurement precision and monitoring infrastructure evolve with scale.

Discuss bottlenecks in data collection and aggregation, then propose automation and distributed monitoring solutions.

Use real numbers to show understanding of volume and cost impact.

Self Check

Your database handles 1000 QPS for storing SLI data. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Introduce caching and aggregation to reduce write load, then add read replicas or partition the database to handle increased QPS.

Key Result

SLI measurement and monitoring infrastructure become the first bottleneck as user scale grows; automating and distributing monitoring is key to scaling SLA and SLO management.