Microservicessystem_design~25 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - System Design Exercise

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Design: Distributed Tracing System for Microservices

Design the tracing collection, storage, and visualization system. Instrumentation libraries and microservice code changes are out of scope.

Functional Requirements

FR1: Trace requests as they flow through multiple microservices

FR2: Collect timing and metadata for each service call

FR3: Visualize traces to identify latency and errors

FR4: Support high throughput with minimal overhead

FR5: Allow querying traces by trace ID, service, or time range

FR6: Integrate with existing microservices without major code changes

Non-Functional Requirements

NFR1: Handle up to 100,000 traces per second

NFR2: End-to-end trace latency under 500ms for visualization

NFR3: 99.9% system availability

NFR4: Minimal impact on microservice performance (less than 5% overhead)

NFR5: Data retention for 7 days

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Instrumentation libraries for trace context and span creation

Trace collector agents or gateways

Storage backend (e.g., Elasticsearch, Cassandra)

Query API and UI for trace visualization

Sampling strategies to reduce data volume

Context propagation mechanisms (HTTP headers, gRPC metadata)

Design Patterns

Context propagation pattern

Sampling pattern (head-based, tail-based)

Fan-out and aggregation of trace data

Asynchronous data ingestion

Data retention and archival

Reference Architecture

  +----------------+       +----------------+       +----------------+
  | Microservice A |-----> | Microservice B |-----> | Microservice C |
  +----------------+       +----------------+       +----------------+
          |                        |                        |
          |                        |                        |
          v                        v                        v
  +---------------------------------------------------------------+
  |                      Instrumentation Libraries                |
  +---------------------------------------------------------------+
                                |
                                v
                      +--------------------+
                      | Trace Collector(s)  |
                      | (Jaeger/Zipkin Agent)|
                      +--------------------+
                                |
                                v
                      +--------------------+
                      |   Storage Backend   |
                      | (Elasticsearch/Cassandra) |
                      +--------------------+
                                |
                                v
                      +--------------------+
                      | Query API & UI     |
                      | (Jaeger/Zipkin UI) |
                      +--------------------+

Components

Instrumentation Libraries

OpenTelemetry SDKs

Automatically create and propagate trace context and spans in microservices

Trace Collector

Jaeger Agent or Zipkin Collector

Receive trace data from services and forward to storage

Storage Backend

Elasticsearch or Cassandra

Store trace and span data for querying and retention

Query API and UI

Jaeger UI or Zipkin UI

Allow users to search, view, and analyze traces

Request Flow

1. 1. A user request enters Microservice A, instrumentation library creates a new trace and span.

2. 2. Trace context is injected into outgoing requests to Microservice B.

3. 3. Microservice B extracts trace context, creates child spans for its operations.

4. 4. This continues through Microservice C and others, each adding spans.

5. 5. Instrumentation libraries send spans asynchronously to the Trace Collector.

6. 6. Trace Collector batches and stores spans in the Storage Backend.

7. 7. User queries the Query API/UI to retrieve and visualize traces by trace ID or filters.

8. 8. UI displays the trace timeline, showing latency and errors across services.

Database Schema

Entities: - Trace: Unique trace ID, start time, end time, status - Span: Span ID, parent span ID, trace ID, service name, operation name, start time, duration, tags, logs Relationships: - One Trace has many Spans (1:N) - Spans linked by parent span ID to form a tree representing call hierarchy

Scaling Discussion

Bottlenecks

High volume of trace data causing storage overload

Trace Collector becoming a bottleneck under heavy load

Query latency increasing with large data size

Network overhead from trace data transmission

Instrumentation overhead impacting microservice performance

Solutions

Implement sampling strategies to reduce trace volume (e.g., probabilistic sampling)

Scale Trace Collectors horizontally with load balancing

Use scalable storage solutions optimized for time-series and search (Elasticsearch clusters, Cassandra)

Compress trace data and batch transmissions to reduce network load

Optimize instrumentation to minimize synchronous calls and use asynchronous reporting

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain how trace context propagates across microservices

Discuss trade-offs of sampling vs full tracing

Describe storage choices and query patterns

Highlight how visualization helps identify latency and errors

Address performance impact and how to minimize it

Show awareness of scaling challenges and solutions

Practice

(1/5)

1. What is the main purpose of distributed tracing tools like Jaeger or Zipkin in microservices?

easy

A. To track and visualize requests as they flow through multiple services

B. To store large amounts of user data securely

C. To replace load balancers in service communication

D. To encrypt network traffic between microservices

Distributed tracing (Jaeger, Zipkin) in Microservices - System Design Exercise

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of distributed tracing

Step 2: Identify the main function of Jaeger and Zipkin

Final Answer:

Quick Check:

Solution

Step 1: Recall standard trace context headers

Step 2: Identify correct header usage

Final Answer:

Quick Check:

Solution

Step 1: Understand root span duration

Step 2: Analyze given spans

Final Answer:

Quick Check:

Solution

Step 1: Identify cause of missing spans

Step 2: Eliminate unrelated causes

Final Answer:

Quick Check:

Solution

Step 1: Consider scalability needs

Step 2: Identify best practice for high volume tracing

Step 3: Eliminate poor options

Final Answer:

Quick Check: