0
0
Microservicessystem_design~25 mins

Why observability is critical in distributed systems in Microservices - Design It to Understand It

Choose your learning style9 modes available
Design: Observability in Distributed Systems
Focus on observability components and their integration with microservices. Exclude detailed implementation of microservices themselves.
Functional Requirements
FR1: Track and monitor system health across multiple microservices
FR2: Detect and diagnose failures quickly
FR3: Understand system behavior and performance under load
FR4: Provide actionable insights for debugging and optimization
FR5: Support real-time alerting for critical issues
Non-Functional Requirements
NFR1: Handle data from hundreds of microservices
NFR2: Low latency for alerting (p99 < 1s)
NFR3: High availability (99.9% uptime) for observability tools
NFR4: Minimal performance impact on production services
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Instrumentation libraries in microservices
Centralized logging system
Metrics collection and storage
Distributed tracing system
Alerting and dashboarding tools
Design Patterns
Correlation IDs for tracing requests
Push vs pull metrics collection
Sampling strategies for traces
Event-driven alerting
Data aggregation and retention policies
Reference Architecture
  +----------------+       +----------------+       +----------------+
  | Microservices  |-----> | Observability  |-----> | Alerting &     |
  | (Instrumented) |       | Data Pipeline  |       | Dashboarding   |
  +----------------+       +----------------+       +----------------+
          |                        |                        |
          |                        |                        |
          v                        v                        v
  +----------------+       +----------------+       +----------------+
  | Logs Storage   |       | Metrics Store  |       | Tracing Store  |
  +----------------+       +----------------+       +----------------+
Components
Microservices Instrumentation
OpenTelemetry SDKs
Collect logs, metrics, and traces from each microservice
Observability Data Pipeline
Kafka or similar message queue
Transport telemetry data reliably to storage systems
Logs Storage
Elasticsearch or Loki
Store and index logs for search and analysis
Metrics Store
Prometheus or TimescaleDB
Store time-series metrics for monitoring and alerting
Tracing Store
Jaeger or Zipkin
Store distributed traces to visualize request flows
Alerting & Dashboarding
Grafana, Alertmanager
Visualize data and send alerts on anomalies
Request Flow
1. 1. Microservices generate telemetry data (logs, metrics, traces) with instrumentation.
2. 2. Data is sent asynchronously to the Observability Data Pipeline (e.g., Kafka).
3. 3. Pipeline routes data to appropriate storage: logs to Logs Storage, metrics to Metrics Store, traces to Tracing Store.
4. 4. Alerting system queries metrics and logs to detect issues based on defined rules.
5. 5. Dashboards visualize real-time system health and performance.
6. 6. When alerts trigger, notifications are sent to engineers for quick response.
Database Schema
Entities: Microservice (id, name), TelemetryData (id, type [log, metric, trace], timestamp, service_id, content), Alert (id, severity, timestamp, service_id, description). Relationships: Microservice 1:N TelemetryData, Microservice 1:N Alert.
Scaling Discussion
Bottlenecks
High volume of telemetry data causing storage overload
Latency in processing and alerting on data
Difficulty correlating data across many services
Performance impact on microservices due to instrumentation
Solutions
Implement sampling and aggregation to reduce data volume
Use scalable storage solutions with partitioning and indexing
Adopt correlation IDs and standardized tracing formats
Use asynchronous, non-blocking instrumentation libraries
Interview Tips
Time: Spend 10 minutes understanding requirements and constraints, 20 minutes designing the architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing key points.
Explain why observability is essential for debugging and reliability in distributed systems.
Describe the types of telemetry data and how they complement each other.
Show how data flows from microservices to storage and alerting.
Discuss trade-offs in data volume, latency, and instrumentation overhead.
Highlight scaling challenges and practical solutions.