Microservicessystem_design~7 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - System Design Guide

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Problem Statement

When a user request passes through many microservices, it becomes nearly impossible to track where delays or errors occur. Without a clear way to follow the request path, debugging slow responses or failures turns into guesswork, increasing downtime and reducing reliability.

Solution

Distributed tracing assigns a unique identifier to each user request and tracks it as it flows through all microservices. Each service records timing and metadata about its part of the request, sending this data to a central system like Jaeger or Zipkin. This creates a complete, visual timeline of the request journey, making it easy to spot bottlenecks and errors.

Architecture

Client

→Service A

↓

Tracer SDK

↓

Distributed Tracing Backend

This diagram shows a user request flowing through multiple microservices, each instrumented with tracing SDKs that send timing and metadata to a central tracing backend like Jaeger or Zipkin.

Trade-offs

✓ Pros

→

Provides end-to-end visibility of requests across microservices.

→

Helps quickly identify performance bottlenecks and error sources.

→

Supports root cause analysis by showing detailed timing and metadata.

→

Integrates with existing microservices with minimal code changes using tracing SDKs.

✗ Cons

→

Adds overhead to each request due to tracing data collection and transmission.

→

Requires careful management of trace data storage to avoid high costs.

→

Complexity increases with many services and high request volumes, needing sampling strategies.

Use when your system has multiple microservices handling user requests and you need to diagnose latency or errors across service boundaries, especially at scales above thousands of requests per second.

Avoid if your system is a simple monolith or has very low traffic (under hundreds of requests per second), where the overhead and complexity of distributed tracing outweigh the benefits.

Real World Examples

Uber

Uber uses Jaeger to trace billions of requests daily across its microservices, enabling engineers to pinpoint latency issues in their complex ride-hailing platform.

Netflix

Netflix employs distributed tracing to monitor streaming requests through its microservices, helping to quickly detect and resolve performance bottlenecks.

Airbnb

Airbnb uses Zipkin to trace user booking flows across services, improving debugging and reliability of their platform.

Code Example

The before code shows a simple Flask service without any tracing. The after code adds OpenTelemetry tracing, which automatically instruments the Flask app and manually creates a span for the request handler. This enables capturing trace data sent to a backend like Jaeger.

Microservices

### Before: No tracing
from flask import Flask, request
app = Flask(__name__)

@app.route('/serviceA')
def service_a():
    # Process request
    return 'Service A response'


### After: With distributed tracing using OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

app = Flask(__name__)

# Setup tracer provider and exporter
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(ConsoleSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)

# Instrument Flask app
FlaskInstrumentor().instrument_app(app)

@app.route('/serviceA')
def service_a():
    with tracer.start_as_current_span('serviceA-handler'):
        # Process request
        return 'Service A response'

OutputSuccess

Alternatives

Logging-based tracing

Collects logs from each service and tries to correlate them post-facto instead of real-time trace context propagation.

Use when: Choose when you cannot modify services to add tracing SDKs or need a simpler, less intrusive solution.

Metrics-based monitoring

Uses aggregated metrics like request counts and latencies without detailed request path information.

Use when: Choose when you only need high-level health and performance indicators, not detailed request flows.

Summary

Distributed tracing tracks requests across multiple microservices to find performance issues and errors.

It works by assigning unique IDs and collecting timing data sent to a central system like Jaeger or Zipkin.

This pattern is essential for debugging complex microservice architectures at scale.