Bird
Raised Fist0
Microservicessystem_design~7 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - System Design Guide

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Problem Statement
When a user request passes through many microservices, it becomes nearly impossible to track where delays or errors occur. Without a clear way to follow the request path, debugging slow responses or failures turns into guesswork, increasing downtime and reducing reliability.
Solution
Distributed tracing assigns a unique identifier to each user request and tracks it as it flows through all microservices. Each service records timing and metadata about its part of the request, sending this data to a central system like Jaeger or Zipkin. This creates a complete, visual timeline of the request journey, making it easy to spot bottlenecks and errors.
Architecture
Client
Service A
Tracer SDK
Distributed Tracing Backend
Distributed Tracing Backend

This diagram shows a user request flowing through multiple microservices, each instrumented with tracing SDKs that send timing and metadata to a central tracing backend like Jaeger or Zipkin.

Trade-offs
✓ Pros
Provides end-to-end visibility of requests across microservices.
Helps quickly identify performance bottlenecks and error sources.
Supports root cause analysis by showing detailed timing and metadata.
Integrates with existing microservices with minimal code changes using tracing SDKs.
✗ Cons
Adds overhead to each request due to tracing data collection and transmission.
Requires careful management of trace data storage to avoid high costs.
Complexity increases with many services and high request volumes, needing sampling strategies.
Use when your system has multiple microservices handling user requests and you need to diagnose latency or errors across service boundaries, especially at scales above thousands of requests per second.
Avoid if your system is a simple monolith or has very low traffic (under hundreds of requests per second), where the overhead and complexity of distributed tracing outweigh the benefits.
Real World Examples
Uber
Uber uses Jaeger to trace billions of requests daily across its microservices, enabling engineers to pinpoint latency issues in their complex ride-hailing platform.
Netflix
Netflix employs distributed tracing to monitor streaming requests through its microservices, helping to quickly detect and resolve performance bottlenecks.
Airbnb
Airbnb uses Zipkin to trace user booking flows across services, improving debugging and reliability of their platform.
Code Example
The before code shows a simple Flask service without any tracing. The after code adds OpenTelemetry tracing, which automatically instruments the Flask app and manually creates a span for the request handler. This enables capturing trace data sent to a backend like Jaeger.
Microservices
### Before: No tracing
from flask import Flask, request
app = Flask(__name__)

@app.route('/serviceA')
def service_a():
    # Process request
    return 'Service A response'


### After: With distributed tracing using OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

app = Flask(__name__)

# Setup tracer provider and exporter
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(ConsoleSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)

# Instrument Flask app
FlaskInstrumentor().instrument_app(app)

@app.route('/serviceA')
def service_a():
    with tracer.start_as_current_span('serviceA-handler'):
        # Process request
        return 'Service A response'
OutputSuccess
Alternatives
Logging-based tracing
Collects logs from each service and tries to correlate them post-facto instead of real-time trace context propagation.
Use when: Choose when you cannot modify services to add tracing SDKs or need a simpler, less intrusive solution.
Metrics-based monitoring
Uses aggregated metrics like request counts and latencies without detailed request path information.
Use when: Choose when you only need high-level health and performance indicators, not detailed request flows.
Summary
Distributed tracing tracks requests across multiple microservices to find performance issues and errors.
It works by assigning unique IDs and collecting timing data sent to a central system like Jaeger or Zipkin.
This pattern is essential for debugging complex microservice architectures at scale.

Practice

(1/5)
1. What is the main purpose of distributed tracing tools like Jaeger or Zipkin in microservices?
easy
A. To track and visualize requests as they flow through multiple services
B. To store large amounts of user data securely
C. To replace load balancers in service communication
D. To encrypt network traffic between microservices

Solution

  1. Step 1: Understand the role of distributed tracing

    Distributed tracing tools help monitor how requests move through different microservices by collecting timing and metadata.
  2. Step 2: Identify the main function of Jaeger and Zipkin

    They visualize and analyze traces made of spans to find bottlenecks or errors in service chains.
  3. Final Answer:

    To track and visualize requests as they flow through multiple services -> Option A
  4. Quick Check:

    Distributed tracing = track requests flow [OK]
Hint: Distributed tracing = tracking requests across services [OK]
Common Mistakes:
  • Confusing tracing with data storage
  • Thinking tracing replaces load balancers
  • Assuming tracing encrypts traffic
2. Which of the following is the correct way to propagate trace context between microservices using HTTP headers?
easy
A. Add Cookie header with span ID
B. Add Authorization header with trace ID
C. Add X-B3-TraceId and X-B3-SpanId headers to the outgoing request
D. Add Content-Type header with trace ID value

Solution

  1. Step 1: Recall standard trace context headers

    Distributed tracing uses specific headers like X-B3-TraceId and X-B3-SpanId to pass trace info between services.
  2. Step 2: Identify correct header usage

    Headers like Authorization, Content-Type, or Cookie are unrelated to tracing context propagation.
  3. Final Answer:

    Add X-B3-TraceId and X-B3-SpanId headers to the outgoing request -> Option C
  4. Quick Check:

    Trace context headers = X-B3-TraceId, X-B3-SpanId [OK]
Hint: Trace context uses X-B3 headers, not auth or content-type [OK]
Common Mistakes:
  • Using unrelated HTTP headers for trace context
  • Forgetting to propagate span ID
  • Confusing trace ID with authentication tokens
3. Given the following trace spans collected by Zipkin, what is the total time taken by the root request?
Span A (root): start=0ms, duration=50ms
Span B (child of A): start=10ms, duration=20ms
Span C (child of A): start=35ms, duration=10ms
medium
A. 50ms
B. 40ms
C. 30ms
D. 60ms

Solution

  1. Step 1: Understand root span duration

    The root span duration represents the total time of the entire request, including child spans.
  2. Step 2: Analyze given spans

    Span A starts at 0ms and lasts 50ms, so total time is 50ms regardless of child spans.
  3. Final Answer:

    50ms -> Option A
  4. Quick Check:

    Root span duration = total request time = 50ms [OK]
Hint: Root span duration = total request time [OK]
Common Mistakes:
  • Adding child spans durations incorrectly
  • Ignoring root span duration
  • Confusing start times with total duration
4. You notice that your distributed tracing data in Jaeger shows many missing spans for some services. What is the most likely cause?
medium
A. The network latency is too low
B. The services have too many CPU cores
C. The database is down
D. The services are not propagating the trace context headers correctly

Solution

  1. Step 1: Identify cause of missing spans

    If spans are missing, it usually means trace context was not passed properly between services.
  2. Step 2: Eliminate unrelated causes

    CPU cores, database status, or low network latency do not cause missing trace spans.
  3. Final Answer:

    The services are not propagating the trace context headers correctly -> Option D
  4. Quick Check:

    Missing spans = trace context not propagated [OK]
Hint: Missing spans? Check trace context propagation [OK]
Common Mistakes:
  • Blaming unrelated system resources
  • Ignoring header propagation
  • Assuming network latency causes missing spans
5. You want to design a distributed tracing system for a microservices architecture with 100 services and high request volume. Which approach best ensures scalability and minimal overhead?
hard
A. Trace every request fully and store all spans in a single central database
B. Use sampling to trace only a subset of requests and propagate trace context with lightweight headers
C. Disable trace context propagation and log spans locally in each service
D. Use synchronous calls to the tracing backend for every span creation

Solution

  1. Step 1: Consider scalability needs

    Tracing every request fully in a large system causes high overhead and storage issues.
  2. Step 2: Identify best practice for high volume tracing

    Sampling reduces load by tracing only some requests, and lightweight headers keep propagation efficient.
  3. Step 3: Eliminate poor options

    Disabling propagation loses trace linkage; synchronous calls add latency; central DB can bottleneck.
  4. Final Answer:

    Use sampling to trace only a subset of requests and propagate trace context with lightweight headers -> Option B
  5. Quick Check:

    Sampling + lightweight headers = scalable tracing [OK]
Hint: Sampling + lightweight headers = scalable tracing [OK]
Common Mistakes:
  • Tracing all requests causing overhead
  • Ignoring trace context propagation
  • Using synchronous calls causing latency