Bird
Raised Fist0
Microservicessystem_design~25 mins

Three pillars (metrics, logs, traces) in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Observability System for Microservices
Design focuses on the observability platform components for metrics, logs, and traces collection, storage, and visualization. It excludes microservices implementation and alerting rules creation.
Functional Requirements
FR1: Collect and store metrics from all microservices to monitor performance and resource usage
FR2: Collect and store logs from microservices for debugging and auditing
FR3: Collect and store distributed traces to understand request flows across services
FR4: Provide real-time dashboards and alerting based on metrics
FR5: Allow querying and searching logs efficiently
FR6: Visualize traces to identify latency bottlenecks
FR7: Support at least 1000 microservices generating data concurrently
FR8: Ensure data retention for 30 days for metrics and logs, 7 days for traces
Non-Functional Requirements
NFR1: System must handle ingestion of 1 million metrics data points per second
NFR2: Logs ingestion rate up to 500,000 log entries per second
NFR3: Trace data must have p99 latency under 5 seconds from generation to storage
NFR4: System availability target 99.9% uptime
NFR5: Data storage must be cost-effective and scalable
NFR6: APIs for querying must respond within 2 seconds for common queries
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
❓ Question 7
Key Components
Metrics collector and time-series database
Log aggregator and searchable log store
Distributed tracing collector and storage
Data ingestion pipelines
Visualization dashboards
Alerting and notification system
Authentication and authorization
Design Patterns
Push vs pull metrics collection
Log aggregation with centralized logging
Distributed tracing with context propagation
Data partitioning and sharding
Indexing for fast log search
Sampling and data retention strategies
Reference Architecture
                    +---------------------+
                    |   Microservices      |
                    |  (Metrics, Logs,     |
                    |   Traces Emitters)   |
                    +----------+----------+
                               |
          +--------------------+--------------------+
          |                                         |
+---------v---------+                     +---------v---------+
| Metrics Collector  |                     | Log Collector     |
| (Prometheus Push)  |                     | (Fluentd/Logstash)|
+---------+---------+                     +---------+---------+
          |                                         |
+---------v---------+                     +---------v---------+
| Time-Series DB    |                     | Log Storage       |
| (e.g., Prometheus |                     | (Elasticsearch)   |
|  or Cortex)       |                     +---------+---------+
+---------+---------+                               |
          |                                         |
          |                                         |
+---------v---------+                     +---------v---------+
| Tracing Collector |                     | Visualization     |
| (Jaeger Collector)|                     | Dashboards        |
+---------+---------+                     | (Grafana)         |
          |                               +---------+---------+
+---------v---------+                               |
| Trace Storage     |                               |
| (Cassandra or     |                               |
|  Elasticsearch)   |                               |
+-------------------+                               |
                                                    |
                                         +----------v----------+
                                         | Alerting &          |
                                         | Notification System |
                                         +---------------------+
Components
Metrics Collector
Prometheus Pushgateway or Prometheus exporters
Collects metrics data from microservices using pull or push methods.
Time-Series Database
Prometheus or Cortex
Stores and indexes metrics data optimized for time-series queries.
Log Collector
Fluentd or Logstash
Aggregates logs from microservices, parses and forwards them.
Log Storage
Elasticsearch
Stores logs with indexing for fast search and retrieval.
Tracing Collector
Jaeger Collector or OpenTelemetry Collector
Receives distributed trace data from microservices.
Trace Storage
Cassandra or Elasticsearch
Stores trace spans and supports trace queries and visualization.
Visualization Dashboards
Grafana
Provides real-time dashboards for metrics, logs, and traces.
Alerting & Notification System
Prometheus Alertmanager or custom
Generates alerts based on metrics thresholds and notifies teams.
Request Flow
1. 1. Microservices emit metrics, logs, and traces continuously.
2. 2. Metrics Collector scrapes or receives pushed metrics data.
3. 3. Metrics data is stored in the Time-Series Database.
4. 4. Log Collector aggregates logs from microservices and forwards them.
5. 5. Logs are indexed and stored in the Log Storage system.
6. 6. Tracing Collector receives trace spans with context propagation.
7. 7. Trace data is stored in Trace Storage for querying and visualization.
8. 8. Visualization Dashboards query metrics, logs, and traces for display.
9. 9. Alerting System monitors metrics and triggers alerts when thresholds breach.
Database Schema
Entities: - Metric: {timestamp, service_id, metric_name, value, labels} - LogEntry: {timestamp, service_id, log_level, message, trace_id, span_id, metadata} - TraceSpan: {trace_id, span_id, parent_span_id, service_id, operation_name, start_time, duration, tags} Relationships: - Logs and TraceSpans link via trace_id for correlation. - Metrics are tagged by service_id and labels for filtering.
Scaling Discussion
Bottlenecks
High ingestion rate causing overload on collectors and storage
Storage size growth leading to increased query latency
Trace data volume causing slow trace retrieval
Dashboard query load impacting system responsiveness
Solutions
Use horizontal scaling and sharding for collectors and storage clusters
Implement data downsampling and aggregation for older metrics
Apply sampling strategies for traces to reduce volume
Use caching layers and query optimization for dashboards
Interview Tips
Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing the architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain the importance of the three pillars for observability
Discuss data collection methods and protocols
Describe storage choices optimized for each data type
Highlight how data flows from microservices to visualization
Address scaling challenges and mitigation strategies
Mention security and data retention considerations

Practice

(1/5)
1. Which of the following best describes the role of metrics in microservices monitoring?
easy
A. They track the path of a request through multiple services.
B. They record detailed events and errors in the system.
C. They provide numerical data about system performance over time.
D. They store configuration settings for microservices.

Solution

  1. Step 1: Understand what metrics represent

    Metrics are numerical measurements like CPU usage, request counts, or latency that show system health over time.
  2. Step 2: Differentiate metrics from logs and traces

    Logs record events, traces follow request paths, but metrics summarize performance data.
  3. Final Answer:

    They provide numerical data about system performance over time. -> Option C
  4. Quick Check:

    Metrics = numerical performance data [OK]
Hint: Metrics = numbers about performance, not events or paths [OK]
Common Mistakes:
  • Confusing metrics with logs as event records
  • Thinking traces are numerical data
  • Assuming metrics store configurations
2. Which syntax correctly represents a log entry in a microservice system?
easy
A. [2024-06-01 12:00:00] ERROR Failed to connect
B. {"timestamp": "2024-06-01T12:00:00Z", "level": "ERROR", "message": "Failed to connect"}
C. Failed to connect
D. ERROR 2024-06-01T12:00:00Z Failed to connect

Solution

  1. Step 1: Identify standard log formats

    JSON format is widely used for structured logs in microservices for easy parsing and querying.
  2. Step 2: Compare options for correctness

    {"timestamp": "2024-06-01T12:00:00Z", "level": "ERROR", "message": "Failed to connect"} is a valid JSON log entry with timestamp, level, and message fields. Others are less structured or not JSON.
  3. Final Answer:

    {"timestamp": "2024-06-01T12:00:00Z", "level": "ERROR", "message": "Failed to connect"} -> Option B
  4. Quick Check:

    Structured JSON logs = {"timestamp": "2024-06-01T12:00:00Z", "level": "ERROR", "message": "Failed to connect"} [OK]
Hint: Logs are best as structured JSON for easy use [OK]
Common Mistakes:
  • Using unstructured plain text logs
  • Confusing XML-like logs with JSON
  • Ignoring timestamp or level fields
3. Given this trace data snippet for a request through three microservices, what is the total time spent processing the request?
{
  "traceId": "abc123",
  "spans": [
    {"service": "A", "duration_ms": 50},
    {"service": "B", "duration_ms": 30},
    {"service": "C", "duration_ms": 20}
  ]
}
medium
A. 100 ms
B. 50 ms
C. 30 ms
D. 20 ms

Solution

  1. Step 1: Understand trace spans and durations

    Each span shows time spent in a service. Total time is sum if services are sequential.
  2. Step 2: Sum durations of all spans

    50 ms + 30 ms + 20 ms = 100 ms total processing time.
  3. Final Answer:

    100 ms -> Option A
  4. Quick Check:

    Sum spans durations = 100 ms [OK]
Hint: Add all span durations for total trace time [OK]
Common Mistakes:
  • Taking only the longest span as total time
  • Ignoring some spans in calculation
  • Confusing traceId with duration
4. A developer notices that logs are missing trace IDs in a microservices system. What is the most likely cause?
medium
A. Services are using different programming languages.
B. Metrics collection is disabled.
C. Logs are stored in a different database.
D. Trace context is not propagated between services.

Solution

  1. Step 1: Understand trace ID propagation

    Trace IDs must be passed along service calls to link logs and traces.
  2. Step 2: Identify cause of missing trace IDs

    If trace context is not propagated, logs won't have trace IDs, breaking trace-log correlation.
  3. Final Answer:

    Trace context is not propagated between services. -> Option D
  4. Quick Check:

    Missing trace IDs = missing context propagation [OK]
Hint: Trace IDs must flow between services to appear in logs [OK]
Common Mistakes:
  • Confusing metrics with trace IDs
  • Assuming storage location causes missing IDs
  • Blaming programming language differences
5. You are designing a microservices system and want to implement the three pillars: metrics, logs, and traces. Which approach best ensures scalability and effective monitoring?
hard
A. Use a centralized monitoring system that collects metrics via Prometheus, logs via ELK stack, and traces via OpenTelemetry.
B. Store all logs and traces locally on each service to reduce network overhead.
C. Only collect metrics and ignore logs and traces to save storage space.
D. Send all raw logs and traces directly to the client application for analysis.

Solution

  1. Step 1: Identify best practices for scalable monitoring

    Centralized systems like Prometheus for metrics, ELK for logs, and OpenTelemetry for traces are industry standards for scalability and analysis.
  2. Step 2: Evaluate options for scalability and effectiveness

    Local storage limits analysis and scalability; ignoring logs/traces loses insights; sending raw data to clients is inefficient and insecure.
  3. Final Answer:

    Use a centralized monitoring system that collects metrics via Prometheus, logs via ELK stack, and traces via OpenTelemetry. -> Option A
  4. Quick Check:

    Centralized, specialized tools = scalable monitoring [OK]
Hint: Centralize collection with proven tools for all three pillars [OK]
Common Mistakes:
  • Storing logs/traces locally only
  • Ignoring logs or traces
  • Sending raw data directly to clients