Microservicessystem_design~25 mins

Three pillars (metrics, logs, traces) in Microservices - System Design Exercise

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Design: Observability System for Microservices

Design focuses on the observability platform components for metrics, logs, and traces collection, storage, and visualization. It excludes microservices implementation and alerting rules creation.

Functional Requirements

FR1: Collect and store metrics from all microservices to monitor performance and resource usage

FR2: Collect and store logs from microservices for debugging and auditing

FR3: Collect and store distributed traces to understand request flows across services

FR4: Provide real-time dashboards and alerting based on metrics

FR5: Allow querying and searching logs efficiently

FR6: Visualize traces to identify latency bottlenecks

FR7: Support at least 1000 microservices generating data concurrently

FR8: Ensure data retention for 30 days for metrics and logs, 7 days for traces

Non-Functional Requirements

NFR1: System must handle ingestion of 1 million metrics data points per second

NFR2: Logs ingestion rate up to 500,000 log entries per second

NFR3: Trace data must have p99 latency under 5 seconds from generation to storage

NFR4: System availability target 99.9% uptime

NFR5: Data storage must be cost-effective and scalable

NFR6: APIs for querying must respond within 2 seconds for common queries

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

❓ Question 7

Key Components

Metrics collector and time-series database

Log aggregator and searchable log store

Distributed tracing collector and storage

Data ingestion pipelines

Visualization dashboards

Alerting and notification system

Authentication and authorization

Design Patterns

Push vs pull metrics collection

Log aggregation with centralized logging

Distributed tracing with context propagation

Data partitioning and sharding

Indexing for fast log search

Sampling and data retention strategies

Reference Architecture

                    +---------------------+
                    |   Microservices      |
                    |  (Metrics, Logs,     |
                    |   Traces Emitters)   |
                    +----------+----------+
                               |
          +--------------------+--------------------+
          |                                         |
+---------v---------+                     +---------v---------+
| Metrics Collector  |                     | Log Collector     |
| (Prometheus Push)  |                     | (Fluentd/Logstash)|
+---------+---------+                     +---------+---------+
          |                                         |
+---------v---------+                     +---------v---------+
| Time-Series DB    |                     | Log Storage       |
| (e.g., Prometheus |                     | (Elasticsearch)   |
|  or Cortex)       |                     +---------+---------+
+---------+---------+                               |
          |                                         |
          |                                         |
+---------v---------+                     +---------v---------+
| Tracing Collector |                     | Visualization     |
| (Jaeger Collector)|                     | Dashboards        |
+---------+---------+                     | (Grafana)         |
          |                               +---------+---------+
+---------v---------+                               |
| Trace Storage     |                               |
| (Cassandra or     |                               |
|  Elasticsearch)   |                               |
+-------------------+                               |
                                                    |
                                         +----------v----------+
                                         | Alerting &          |
                                         | Notification System |
                                         +---------------------+

Components

Metrics Collector

Prometheus Pushgateway or Prometheus exporters

Collects metrics data from microservices using pull or push methods.

Time-Series Database

Prometheus or Cortex

Stores and indexes metrics data optimized for time-series queries.

Log Collector

Fluentd or Logstash

Aggregates logs from microservices, parses and forwards them.

Log Storage

Elasticsearch

Stores logs with indexing for fast search and retrieval.

Tracing Collector

Jaeger Collector or OpenTelemetry Collector

Receives distributed trace data from microservices.

Trace Storage

Cassandra or Elasticsearch

Stores trace spans and supports trace queries and visualization.

Visualization Dashboards

Grafana

Provides real-time dashboards for metrics, logs, and traces.

Alerting & Notification System

Prometheus Alertmanager or custom

Generates alerts based on metrics thresholds and notifies teams.

Request Flow

1. 1. Microservices emit metrics, logs, and traces continuously.

2. 2. Metrics Collector scrapes or receives pushed metrics data.

3. 3. Metrics data is stored in the Time-Series Database.

4. 4. Log Collector aggregates logs from microservices and forwards them.

5. 5. Logs are indexed and stored in the Log Storage system.

6. 6. Tracing Collector receives trace spans with context propagation.

7. 7. Trace data is stored in Trace Storage for querying and visualization.

8. 8. Visualization Dashboards query metrics, logs, and traces for display.

9. 9. Alerting System monitors metrics and triggers alerts when thresholds breach.

Database Schema

Entities: - Metric: {timestamp, service_id, metric_name, value, labels} - LogEntry: {timestamp, service_id, log_level, message, trace_id, span_id, metadata} - TraceSpan: {trace_id, span_id, parent_span_id, service_id, operation_name, start_time, duration, tags} Relationships: - Logs and TraceSpans link via trace_id for correlation. - Metrics are tagged by service_id and labels for filtering.

Scaling Discussion

Bottlenecks

High ingestion rate causing overload on collectors and storage

Storage size growth leading to increased query latency

Trace data volume causing slow trace retrieval

Dashboard query load impacting system responsiveness

Solutions

Use horizontal scaling and sharding for collectors and storage clusters

Implement data downsampling and aggregation for older metrics

Apply sampling strategies for traces to reduce volume

Use caching layers and query optimization for dashboards

Interview Tips

Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing the architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain the importance of the three pillars for observability

Discuss data collection methods and protocols

Describe storage choices optimized for each data type

Highlight how data flows from microservices to visualization

Address scaling challenges and mitigation strategies

Mention security and data retention considerations

Practice

(1/5)

1. Which of the following best describes the role of metrics in microservices monitoring?

easy

A. They track the path of a request through multiple services.

B. They record detailed events and errors in the system.

C. They provide numerical data about system performance over time.

D. They store configuration settings for microservices.

Three pillars (metrics, logs, traces) in Microservices - System Design Exercise

Start learning this pattern below

Practice

Solution

Step 1: Understand what metrics represent

Step 2: Differentiate metrics from logs and traces

Final Answer:

Quick Check:

Solution

Step 1: Identify standard log formats

Step 2: Compare options for correctness

Final Answer:

Quick Check:

Solution

Step 1: Understand trace spans and durations

Step 2: Sum durations of all spans

Final Answer:

Quick Check:

Solution

Step 1: Understand trace ID propagation

Step 2: Identify cause of missing trace IDs

Final Answer:

Quick Check:

Solution

Step 1: Identify best practices for scalable monitoring

Step 2: Evaluate options for scalability and effectiveness

Final Answer:

Quick Check: