Bird
Raised Fist0
Microservicessystem_design~25 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Distributed Tracing System for Microservices
Design the tracing collection, storage, and visualization system. Instrumentation libraries and microservice code changes are out of scope.
Functional Requirements
FR1: Trace requests as they flow through multiple microservices
FR2: Collect timing and metadata for each service call
FR3: Visualize traces to identify latency and errors
FR4: Support high throughput with minimal overhead
FR5: Allow querying traces by trace ID, service, or time range
FR6: Integrate with existing microservices without major code changes
Non-Functional Requirements
NFR1: Handle up to 100,000 traces per second
NFR2: End-to-end trace latency under 500ms for visualization
NFR3: 99.9% system availability
NFR4: Minimal impact on microservice performance (less than 5% overhead)
NFR5: Data retention for 7 days
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Instrumentation libraries for trace context and span creation
Trace collector agents or gateways
Storage backend (e.g., Elasticsearch, Cassandra)
Query API and UI for trace visualization
Sampling strategies to reduce data volume
Context propagation mechanisms (HTTP headers, gRPC metadata)
Design Patterns
Context propagation pattern
Sampling pattern (head-based, tail-based)
Fan-out and aggregation of trace data
Asynchronous data ingestion
Data retention and archival
Reference Architecture
  +----------------+       +----------------+       +----------------+
  | Microservice A |-----> | Microservice B |-----> | Microservice C |
  +----------------+       +----------------+       +----------------+
          |                        |                        |
          |                        |                        |
          v                        v                        v
  +---------------------------------------------------------------+
  |                      Instrumentation Libraries                |
  +---------------------------------------------------------------+
                                |
                                v
                      +--------------------+
                      | Trace Collector(s)  |
                      | (Jaeger/Zipkin Agent)|
                      +--------------------+
                                |
                                v
                      +--------------------+
                      |   Storage Backend   |
                      | (Elasticsearch/Cassandra) |
                      +--------------------+
                                |
                                v
                      +--------------------+
                      | Query API & UI     |
                      | (Jaeger/Zipkin UI) |
                      +--------------------+
Components
Instrumentation Libraries
OpenTelemetry SDKs
Automatically create and propagate trace context and spans in microservices
Trace Collector
Jaeger Agent or Zipkin Collector
Receive trace data from services and forward to storage
Storage Backend
Elasticsearch or Cassandra
Store trace and span data for querying and retention
Query API and UI
Jaeger UI or Zipkin UI
Allow users to search, view, and analyze traces
Request Flow
1. 1. A user request enters Microservice A, instrumentation library creates a new trace and span.
2. 2. Trace context is injected into outgoing requests to Microservice B.
3. 3. Microservice B extracts trace context, creates child spans for its operations.
4. 4. This continues through Microservice C and others, each adding spans.
5. 5. Instrumentation libraries send spans asynchronously to the Trace Collector.
6. 6. Trace Collector batches and stores spans in the Storage Backend.
7. 7. User queries the Query API/UI to retrieve and visualize traces by trace ID or filters.
8. 8. UI displays the trace timeline, showing latency and errors across services.
Database Schema
Entities: - Trace: Unique trace ID, start time, end time, status - Span: Span ID, parent span ID, trace ID, service name, operation name, start time, duration, tags, logs Relationships: - One Trace has many Spans (1:N) - Spans linked by parent span ID to form a tree representing call hierarchy
Scaling Discussion
Bottlenecks
High volume of trace data causing storage overload
Trace Collector becoming a bottleneck under heavy load
Query latency increasing with large data size
Network overhead from trace data transmission
Instrumentation overhead impacting microservice performance
Solutions
Implement sampling strategies to reduce trace volume (e.g., probabilistic sampling)
Scale Trace Collectors horizontally with load balancing
Use scalable storage solutions optimized for time-series and search (Elasticsearch clusters, Cassandra)
Compress trace data and batch transmissions to reduce network load
Optimize instrumentation to minimize synchronous calls and use asynchronous reporting
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain how trace context propagates across microservices
Discuss trade-offs of sampling vs full tracing
Describe storage choices and query patterns
Highlight how visualization helps identify latency and errors
Address performance impact and how to minimize it
Show awareness of scaling challenges and solutions

Practice

(1/5)
1. What is the main purpose of distributed tracing tools like Jaeger or Zipkin in microservices?
easy
A. To track and visualize requests as they flow through multiple services
B. To store large amounts of user data securely
C. To replace load balancers in service communication
D. To encrypt network traffic between microservices

Solution

  1. Step 1: Understand the role of distributed tracing

    Distributed tracing tools help monitor how requests move through different microservices by collecting timing and metadata.
  2. Step 2: Identify the main function of Jaeger and Zipkin

    They visualize and analyze traces made of spans to find bottlenecks or errors in service chains.
  3. Final Answer:

    To track and visualize requests as they flow through multiple services -> Option A
  4. Quick Check:

    Distributed tracing = track requests flow [OK]
Hint: Distributed tracing = tracking requests across services [OK]
Common Mistakes:
  • Confusing tracing with data storage
  • Thinking tracing replaces load balancers
  • Assuming tracing encrypts traffic
2. Which of the following is the correct way to propagate trace context between microservices using HTTP headers?
easy
A. Add Cookie header with span ID
B. Add Authorization header with trace ID
C. Add X-B3-TraceId and X-B3-SpanId headers to the outgoing request
D. Add Content-Type header with trace ID value

Solution

  1. Step 1: Recall standard trace context headers

    Distributed tracing uses specific headers like X-B3-TraceId and X-B3-SpanId to pass trace info between services.
  2. Step 2: Identify correct header usage

    Headers like Authorization, Content-Type, or Cookie are unrelated to tracing context propagation.
  3. Final Answer:

    Add X-B3-TraceId and X-B3-SpanId headers to the outgoing request -> Option C
  4. Quick Check:

    Trace context headers = X-B3-TraceId, X-B3-SpanId [OK]
Hint: Trace context uses X-B3 headers, not auth or content-type [OK]
Common Mistakes:
  • Using unrelated HTTP headers for trace context
  • Forgetting to propagate span ID
  • Confusing trace ID with authentication tokens
3. Given the following trace spans collected by Zipkin, what is the total time taken by the root request?
Span A (root): start=0ms, duration=50ms
Span B (child of A): start=10ms, duration=20ms
Span C (child of A): start=35ms, duration=10ms
medium
A. 50ms
B. 40ms
C. 30ms
D. 60ms

Solution

  1. Step 1: Understand root span duration

    The root span duration represents the total time of the entire request, including child spans.
  2. Step 2: Analyze given spans

    Span A starts at 0ms and lasts 50ms, so total time is 50ms regardless of child spans.
  3. Final Answer:

    50ms -> Option A
  4. Quick Check:

    Root span duration = total request time = 50ms [OK]
Hint: Root span duration = total request time [OK]
Common Mistakes:
  • Adding child spans durations incorrectly
  • Ignoring root span duration
  • Confusing start times with total duration
4. You notice that your distributed tracing data in Jaeger shows many missing spans for some services. What is the most likely cause?
medium
A. The network latency is too low
B. The services have too many CPU cores
C. The database is down
D. The services are not propagating the trace context headers correctly

Solution

  1. Step 1: Identify cause of missing spans

    If spans are missing, it usually means trace context was not passed properly between services.
  2. Step 2: Eliminate unrelated causes

    CPU cores, database status, or low network latency do not cause missing trace spans.
  3. Final Answer:

    The services are not propagating the trace context headers correctly -> Option D
  4. Quick Check:

    Missing spans = trace context not propagated [OK]
Hint: Missing spans? Check trace context propagation [OK]
Common Mistakes:
  • Blaming unrelated system resources
  • Ignoring header propagation
  • Assuming network latency causes missing spans
5. You want to design a distributed tracing system for a microservices architecture with 100 services and high request volume. Which approach best ensures scalability and minimal overhead?
hard
A. Trace every request fully and store all spans in a single central database
B. Use sampling to trace only a subset of requests and propagate trace context with lightweight headers
C. Disable trace context propagation and log spans locally in each service
D. Use synchronous calls to the tracing backend for every span creation

Solution

  1. Step 1: Consider scalability needs

    Tracing every request fully in a large system causes high overhead and storage issues.
  2. Step 2: Identify best practice for high volume tracing

    Sampling reduces load by tracing only some requests, and lightweight headers keep propagation efficient.
  3. Step 3: Eliminate poor options

    Disabling propagation loses trace linkage; synchronous calls add latency; central DB can bottleneck.
  4. Final Answer:

    Use sampling to trace only a subset of requests and propagate trace context with lightweight headers -> Option B
  5. Quick Check:

    Sampling + lightweight headers = scalable tracing [OK]
Hint: Sampling + lightweight headers = scalable tracing [OK]
Common Mistakes:
  • Tracing all requests causing overhead
  • Ignoring trace context propagation
  • Using synchronous calls causing latency