0
0
HLDsystem_design~7 mins

SLA, SLO, and SLI definitions in HLD - System Design Guide

Choose your learning style9 modes available
Problem Statement
Without clear agreements on service reliability, users face unpredictable downtime and poor performance. This causes frustration and loss of trust because expectations are not set or measured. Teams also struggle to prioritize fixes without measurable targets.
Solution
Define measurable indicators of service health (SLIs), set target levels for these indicators (SLOs), and formalize commitments to users (SLAs). This creates clear expectations and accountability. Teams can monitor SLIs against SLOs to maintain service quality and handle breaches according to SLAs.
Architecture
Service
SLI
Monitoring
SLA

This diagram shows how a service's performance is measured by SLIs, which are compared against SLOs. Monitoring tracks these metrics, and SLAs formalize the commitments to users based on these targets.

Trade-offs
✓ Pros
Provides clear, measurable targets for service reliability and performance.
Aligns user expectations with engineering goals to reduce disputes.
Enables proactive monitoring and faster incident response.
Helps prioritize engineering efforts based on impact to users.
✗ Cons
Requires effort to define meaningful and measurable SLIs.
Setting unrealistic SLOs can lead to frequent breaches and loss of trust.
SLAs can create legal or financial risk if not carefully managed.
Use when operating services with external users or internal teams needing reliability guarantees, especially at scale with multiple stakeholders.
Avoid for small projects or prototypes where formal reliability targets add overhead without clear benefit.
Real World Examples
Google
Defines SLIs like request latency and error rate for services like Google Cloud, sets SLOs to maintain user trust, and uses SLAs to formalize uptime guarantees.
Netflix
Uses SLIs to monitor streaming quality and availability, sets SLOs to ensure smooth playback, and SLAs to manage customer expectations.
Amazon AWS
Publishes SLAs with uptime guarantees for services like EC2 and S3, backed by SLOs and SLIs tracked internally to maintain service health.
Alternatives
Error Budgeting
Focuses on the allowable error or downtime within SLOs to balance innovation and reliability.
Use when: Use when you want to manage risk tolerance and prioritize feature releases alongside reliability.
Service Level Management (SLM)
Broader ITIL-based process including SLA negotiation, monitoring, and reporting beyond just metrics.
Use when: Choose when managing multiple services with formal IT service management processes.
Summary
SLIs are specific metrics that measure service health and performance.
SLOs set target levels for SLIs to define acceptable service quality.
SLAs are formal agreements with users based on SLOs to set expectations and accountability.