0
0
HLDsystem_design~25 mins

SLA, SLO, and SLI definitions in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Service Level Agreement (SLA), Service Level Objective (SLO), and Service Level Indicator (SLI) Definitions
In scope: Definitions, relationships, examples, and monitoring basics. Out of scope: Detailed implementation of monitoring tools or advanced reliability engineering.
Functional Requirements
FR1: Define SLA, SLO, and SLI clearly with examples
FR2: Explain how these terms relate to each other in service reliability
FR3: Show how to measure and monitor SLIs to meet SLOs and SLAs
FR4: Provide a simple example of a web service with SLA, SLO, and SLI
Non-Functional Requirements
NFR1: Definitions must be simple and jargon-free
NFR2: Examples must be relatable to common web services
NFR3: Focus on clarity and practical understanding
NFR4: No complex technical details beyond basic monitoring
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
Key Components
Service Level Indicator (SLI) metrics like uptime, latency, error rate
Service Level Objective (SLO) targets for these metrics
Service Level Agreement (SLA) as a formal contract based on SLOs
Monitoring and alerting tools to track SLIs
Design Patterns
Monitoring and alerting pattern
Feedback loop for continuous improvement
Contract-based service reliability
Reference Architecture
  +-------------------+       +-------------------+       +-------------------+
  |   Web Service     | <---> |   Monitoring      | <---> |   Alerting System  |
  +-------------------+       +-------------------+       +-------------------+
          |                          |                             |
          | SLIs (uptime, latency)   |                             |
          |------------------------->|                             |
                                     |                             |
                                     | Checks if SLOs met          |
                                     |---------------------------->|
                                                                   | Sends alerts if SLA violated
Components
Web Service
Any web server (e.g., Nginx, Apache, Node.js)
Provides the service whose reliability is measured
Monitoring
Prometheus, Datadog, or simple custom metrics
Collects SLIs such as uptime, latency, and error rate
Alerting System
PagerDuty, Opsgenie, or email alerts
Notifies when SLOs are not met, indicating SLA risk
Request Flow
1. User sends request to Web Service
2. Web Service processes request and responds
3. Monitoring system collects SLIs like response time and error rate
4. Monitoring compares SLIs against defined SLO targets
5. If SLOs are violated, Alerting System sends notifications
6. SLA defines the formal agreement based on these SLOs
Database Schema
Entities: - Service: id, name, description - SLI: id, service_id, metric_name (e.g., uptime, latency), measurement_method - SLO: id, service_id, sli_id, target_value (e.g., 99.9%), time_window - SLA: id, service_id, slo_id, penalty_terms Relationships: - Service 1:N SLI - Service 1:N SLO (each linked to one SLI) - Service 1:N SLA (each linked to one SLO)
Scaling Discussion
Bottlenecks
Monitoring system overwhelmed by high volume of metrics
Alerting system sending too many false positives
Difficulty in accurately measuring SLIs in distributed systems
Solutions
Use sampling and aggregation to reduce metric volume
Set proper thresholds and use anomaly detection to reduce false alerts
Implement distributed tracing and centralized logging for accurate measurement
Interview Tips
Time: Spend 10 minutes defining SLA, SLO, and SLI with examples; 10 minutes explaining their relationships and monitoring; 5 minutes discussing challenges and scaling; 5 minutes for questions and clarifications.
Clear distinction: SLA is a contract, SLO is a target, SLI is a metric
SLIs must be measurable and meaningful
SLOs guide operational goals and alerting
SLAs formalize commitments and consequences
Monitoring and alerting are essential for maintaining reliability