HLDsystem_design~25 mins

SLA, SLO, and SLI definitions in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Service Level Agreement (SLA), Service Level Objective (SLO), and Service Level Indicator (SLI) Definitions

In scope: Definitions, relationships, examples, and monitoring basics. Out of scope: Detailed implementation of monitoring tools or advanced reliability engineering.

Functional Requirements

FR1: Define SLA, SLO, and SLI clearly with examples

FR2: Explain how these terms relate to each other in service reliability

FR3: Show how to measure and monitor SLIs to meet SLOs and SLAs

FR4: Provide a simple example of a web service with SLA, SLO, and SLI

Non-Functional Requirements

NFR1: Definitions must be simple and jargon-free

NFR2: Examples must be relatable to common web services

NFR3: Focus on clarity and practical understanding

NFR4: No complex technical details beyond basic monitoring

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

Key Components

Service Level Indicator (SLI) metrics like uptime, latency, error rate

Service Level Objective (SLO) targets for these metrics

Service Level Agreement (SLA) as a formal contract based on SLOs

Monitoring and alerting tools to track SLIs

Design Patterns

Monitoring and alerting pattern

Feedback loop for continuous improvement

Contract-based service reliability

Reference Architecture

  +-------------------+       +-------------------+       +-------------------+
  |   Web Service     | <---> |   Monitoring      | <---> |   Alerting System  |
  +-------------------+       +-------------------+       +-------------------+
          |                          |                             |
          | SLIs (uptime, latency)   |                             |
          |------------------------->|                             |
                                     |                             |
                                     | Checks if SLOs met          |
                                     |---------------------------->|
                                                                   | Sends alerts if SLA violated

Components

Web Service

Any web server (e.g., Nginx, Apache, Node.js)

Provides the service whose reliability is measured

Monitoring

Prometheus, Datadog, or simple custom metrics

Collects SLIs such as uptime, latency, and error rate

Alerting System

PagerDuty, Opsgenie, or email alerts

Notifies when SLOs are not met, indicating SLA risk

Request Flow

1. User sends request to Web Service

2. Web Service processes request and responds

3. Monitoring system collects SLIs like response time and error rate

4. Monitoring compares SLIs against defined SLO targets

5. If SLOs are violated, Alerting System sends notifications

6. SLA defines the formal agreement based on these SLOs

Database Schema

Entities: - Service: id, name, description - SLI: id, service_id, metric_name (e.g., uptime, latency), measurement_method - SLO: id, service_id, sli_id, target_value (e.g., 99.9%), time_window - SLA: id, service_id, slo_id, penalty_terms Relationships: - Service 1:N SLI - Service 1:N SLO (each linked to one SLI) - Service 1:N SLA (each linked to one SLO)

Scaling Discussion

Bottlenecks

Monitoring system overwhelmed by high volume of metrics

Alerting system sending too many false positives

Difficulty in accurately measuring SLIs in distributed systems

Solutions

Use sampling and aggregation to reduce metric volume

Set proper thresholds and use anomaly detection to reduce false alerts

Implement distributed tracing and centralized logging for accurate measurement

Interview Tips

Time: Spend 10 minutes defining SLA, SLO, and SLI with examples; 10 minutes explaining their relationships and monitoring; 5 minutes discussing challenges and scaling; 5 minutes for questions and clarifications.

Clear distinction: SLA is a contract, SLO is a target, SLI is a metric

SLIs must be measurable and meaningful

SLOs guide operational goals and alerting

SLAs formalize commitments and consequences

Monitoring and alerting are essential for maintaining reliability