Bird
Raised Fist0
Microservicessystem_design~15 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - Deep Dive

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Distributed tracing (Jaeger, Zipkin)
What is it?
Distributed tracing is a way to track how a single user request moves through many small services in a system. It helps see the path and time spent in each service, making it easier to find slow parts or errors. Tools like Jaeger and Zipkin collect and show this information visually. This helps teams understand complex systems made of many connected parts.
Why it matters
Without distributed tracing, it is very hard to know where delays or failures happen in a system made of many services. Teams would waste time guessing or looking at logs from each service separately. Distributed tracing gives a clear story of each request’s journey, saving time and improving reliability. This means faster fixes, better user experience, and more trust in the system.
Where it fits
Before learning distributed tracing, you should understand microservices basics and how services communicate over networks. After this, you can learn about monitoring, logging, and alerting systems that use tracing data to improve system health and performance.
Mental Model
Core Idea
Distributed tracing follows a single request as it travels through many services, recording each step to reveal the full journey and timing.
Think of it like...
Imagine tracking a package sent through multiple delivery centers. Each center scans the package and notes the time it arrived and left. Distributed tracing is like this scanning system for requests moving through services.
Request Start
   │
   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Service A   │───▶│ Service B   │───▶│ Service C   │
│ (Span 1)   │    │ (Span 2)   │    │ (Span 3)   │
└─────────────┘    └─────────────┘    └─────────────┘
   │                  │                  │
   ▼                  ▼                  ▼
Trace Collector (Jaeger/Zipkin) collects all spans
   │
   ▼
Trace Visualization Dashboard
Build-Up - 7 Steps
1
FoundationWhat is a trace and a span
🤔
Concept: Introduce the basic units of distributed tracing: traces and spans.
A trace represents the entire journey of a request through multiple services. Each step in this journey is called a span. A span records the start time, end time, and metadata about that step. Together, spans form a trace that shows the full path and timing.
Result
You understand that a trace is made of spans, and spans represent individual operations or service calls.
Understanding traces and spans is key because they are the building blocks that let us see how requests flow and where time is spent.
2
FoundationHow services pass trace context
🤔
Concept: Explain how trace information travels between services to link spans together.
When one service calls another, it passes trace context (like a trace ID and span ID) in the request headers. This lets the next service create a new span linked to the same trace. Without passing this context, spans would be isolated and not form a full trace.
Result
You know that trace context is passed along service calls to connect spans into a trace.
Knowing how trace context propagates is crucial to build a complete picture of the request journey across services.
3
IntermediateRole of Jaeger and Zipkin collectors
🤔Before reading on: do you think Jaeger and Zipkin collect trace data from services directly or from logs? Commit to your answer.
Concept: Introduce how tracing systems collect and store trace data from services.
Jaeger and Zipkin run collectors that receive spans sent by instrumented services. These collectors store the spans in databases and provide APIs and UIs to query and visualize traces. Services send spans asynchronously to avoid slowing down requests.
Result
You understand that Jaeger and Zipkin act as central places to gather and show trace data.
Knowing the collector’s role helps you see how distributed tracing scales and stays efficient without blocking service operations.
4
IntermediateInstrumentation: automatic vs manual tracing
🤔Before reading on: do you think tracing requires writing code in every service or can it be automatic? Commit to your answer.
Concept: Explain how services get instrumented to produce trace data.
Instrumentation means adding code to create spans. It can be manual, where developers add tracing calls, or automatic, where libraries or frameworks add tracing without code changes. Automatic instrumentation speeds up adoption and reduces errors.
Result
You see the tradeoffs between manual and automatic instrumentation for tracing.
Understanding instrumentation methods helps you plan how to add tracing to existing or new services efficiently.
5
IntermediateTrace sampling and data volume control
🤔Before reading on: do you think tracing every request is always practical? Commit to your answer.
Concept: Introduce sampling to limit the amount of trace data collected.
Tracing every request can produce huge data volumes and slow systems. Sampling means only tracing a portion of requests, like 1 in 100. This reduces overhead while still giving useful insights. Sampling strategies can be fixed rate, probabilistic, or adaptive.
Result
You understand why and how sampling controls tracing data volume.
Knowing sampling prevents performance problems and storage overload in large systems.
6
AdvancedDistributed tracing in production systems
🤔Before reading on: do you think tracing data alone is enough to diagnose all issues? Commit to your answer.
Concept: Explore how tracing integrates with monitoring and alerting in real systems.
In production, tracing is combined with logs and metrics to get full observability. Traces help find slow or failing requests, then logs provide details, and metrics show trends. Tracing data is used in dashboards and alerts to quickly detect and fix problems.
Result
You see how tracing fits into a larger observability strategy.
Understanding tracing’s role in observability helps you design systems that are easier to maintain and troubleshoot.
7
ExpertChallenges and pitfalls of distributed tracing
🤔Before reading on: do you think distributed tracing always shows a perfect picture of request flow? Commit to your answer.
Concept: Discuss common challenges like clock skew, missing spans, and overhead.
Distributed tracing faces issues like clock differences between services causing inaccurate timings, incomplete traces if some spans are missing, and performance overhead if tracing is too detailed. Experts use techniques like timestamp synchronization, careful instrumentation, and adaptive sampling to address these.
Result
You understand the hidden complexities and how experts mitigate them.
Knowing these challenges prepares you to build reliable tracing systems and avoid common mistakes.
Under the Hood
Distributed tracing works by assigning a unique trace ID to each request. Each service creates spans with start and end timestamps and metadata. Trace context is passed via request headers to link spans. Spans are sent asynchronously to a collector that stores them in a database. The collector indexes spans by trace ID and timestamps, enabling fast queries and visualization.
Why designed this way?
This design balances detailed visibility with low overhead. Passing trace context in headers keeps services loosely coupled. Asynchronous span reporting avoids slowing requests. Central collectors enable scalable storage and querying. Alternatives like centralized logging alone were too slow or incomplete for multi-service flows.
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Service A     │────▶│ Service B     │────▶│ Service C     │
│ Span A1       │     │ Span B1       │     │ Span C1       │
│ Trace ID: 123 │     │ Trace ID: 123 │     │ Trace ID: 123 │
└──────┬────────┘     └──────┬────────┘     └──────┬────────┘
       │                     │                     │
       ▼                     ▼                     ▼
  ┌───────────────────────────────────────────────┐
  │           Trace Collector (Jaeger/Zipkin)      │
  │ Stores spans, indexes by trace ID and time     │
  └───────────────────────────────────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does distributed tracing automatically fix all performance problems? Commit yes or no.
Common Belief:Distributed tracing automatically solves all performance and debugging issues once installed.
Tap to reveal reality
Reality:Distributed tracing only provides data; teams must analyze and act on it. It does not fix problems by itself.
Why it matters:Believing tracing is a magic fix leads to ignoring root causes and wasting time expecting instant results.
Quick: Is it safe to trace every single request in a high-traffic system? Commit yes or no.
Common Belief:Tracing every request is always best to get complete data.
Tap to reveal reality
Reality:Tracing every request can overwhelm storage and slow systems. Sampling is needed to balance data and performance.
Why it matters:Ignoring sampling causes system slowdowns and high costs, making tracing unusable in production.
Quick: Does distributed tracing require all services to be written in the same language? Commit yes or no.
Common Belief:All services must use the same programming language or framework for tracing to work.
Tap to reveal reality
Reality:Tracing works across different languages and platforms as long as they follow the tracing protocol and pass context correctly.
Why it matters:Thinking tracing is language-locked limits adoption in diverse environments and causes unnecessary rewrites.
Quick: Does distributed tracing always show exact timings without error? Commit yes or no.
Common Belief:Distributed tracing timings are perfectly accurate and synchronized across services.
Tap to reveal reality
Reality:Clock skew and network delays can cause timing inaccuracies; experts use synchronization and correction techniques.
Why it matters:Assuming perfect accuracy can mislead troubleshooting and cause wrong conclusions about bottlenecks.
Expert Zone
1
Trace context propagation can be lost in asynchronous messaging or retries, requiring special handling.
2
Sampling strategies can be dynamic, adjusting rates based on traffic or error rates to optimize data usefulness.
3
Span tags and logs enrich traces but must be balanced to avoid excessive data and privacy issues.
When NOT to use
Distributed tracing is less useful for simple monolithic applications or batch jobs where request flow is linear and easy to follow. In such cases, traditional logging and profiling tools are better alternatives.
Production Patterns
In production, teams use tracing integrated with alerting systems to detect slow requests automatically. They combine tracing with metrics dashboards and centralized logging to get full observability. Traces are often sampled and stored in scalable backends like Elasticsearch or Cassandra.
Connections
Observability
Distributed tracing is a core pillar alongside logging and metrics in observability.
Understanding tracing helps grasp how observability provides a complete picture of system health and behavior.
Network Packet Tracing
Both trace the path of data through a network, but distributed tracing focuses on application-level requests.
Knowing network tracing concepts clarifies how distributed tracing adds context and timing at the software level.
Supply Chain Tracking
Both track items moving through multiple steps and locations to ensure visibility and accountability.
Seeing distributed tracing as supply chain tracking highlights the importance of context propagation and timing in complex flows.
Common Pitfalls
#1Not passing trace context between services, causing broken traces.
Wrong approach:Service A calls Service B without adding trace headers: httpClient.get('serviceB/api/data')
Correct approach:Service A passes trace context headers: httpClient.get('serviceB/api/data', { headers: { 'trace-id': currentTraceId, 'span-id': currentSpanId } })
Root cause:Misunderstanding that trace context must be explicitly propagated to link spans.
#2Tracing every request without sampling, causing performance issues.
Wrong approach:Always create and send spans for every request regardless of load.
Correct approach:Implement sampling logic to trace only a subset of requests, e.g., 1 in 100.
Root cause:Not realizing the cost and overhead of tracing all requests in high-traffic systems.
#3Manual instrumentation everywhere leading to inconsistent traces.
Wrong approach:Developers add tracing code only in some services or some methods inconsistently.
Correct approach:Use automatic instrumentation libraries or frameworks to ensure consistent tracing coverage.
Root cause:Underestimating the effort and errors in manual instrumentation.
Key Takeaways
Distributed tracing tracks a request’s path through many services by linking spans with trace context.
Passing trace context between services is essential to build a complete trace.
Tools like Jaeger and Zipkin collect, store, and visualize trace data to help diagnose issues.
Sampling is necessary to control tracing data volume and avoid system slowdowns.
Distributed tracing works best combined with logs and metrics for full observability.

Practice

(1/5)
1. What is the main purpose of distributed tracing tools like Jaeger or Zipkin in microservices?
easy
A. To track and visualize requests as they flow through multiple services
B. To store large amounts of user data securely
C. To replace load balancers in service communication
D. To encrypt network traffic between microservices

Solution

  1. Step 1: Understand the role of distributed tracing

    Distributed tracing tools help monitor how requests move through different microservices by collecting timing and metadata.
  2. Step 2: Identify the main function of Jaeger and Zipkin

    They visualize and analyze traces made of spans to find bottlenecks or errors in service chains.
  3. Final Answer:

    To track and visualize requests as they flow through multiple services -> Option A
  4. Quick Check:

    Distributed tracing = track requests flow [OK]
Hint: Distributed tracing = tracking requests across services [OK]
Common Mistakes:
  • Confusing tracing with data storage
  • Thinking tracing replaces load balancers
  • Assuming tracing encrypts traffic
2. Which of the following is the correct way to propagate trace context between microservices using HTTP headers?
easy
A. Add Cookie header with span ID
B. Add Authorization header with trace ID
C. Add X-B3-TraceId and X-B3-SpanId headers to the outgoing request
D. Add Content-Type header with trace ID value

Solution

  1. Step 1: Recall standard trace context headers

    Distributed tracing uses specific headers like X-B3-TraceId and X-B3-SpanId to pass trace info between services.
  2. Step 2: Identify correct header usage

    Headers like Authorization, Content-Type, or Cookie are unrelated to tracing context propagation.
  3. Final Answer:

    Add X-B3-TraceId and X-B3-SpanId headers to the outgoing request -> Option C
  4. Quick Check:

    Trace context headers = X-B3-TraceId, X-B3-SpanId [OK]
Hint: Trace context uses X-B3 headers, not auth or content-type [OK]
Common Mistakes:
  • Using unrelated HTTP headers for trace context
  • Forgetting to propagate span ID
  • Confusing trace ID with authentication tokens
3. Given the following trace spans collected by Zipkin, what is the total time taken by the root request?
Span A (root): start=0ms, duration=50ms
Span B (child of A): start=10ms, duration=20ms
Span C (child of A): start=35ms, duration=10ms
medium
A. 50ms
B. 40ms
C. 30ms
D. 60ms

Solution

  1. Step 1: Understand root span duration

    The root span duration represents the total time of the entire request, including child spans.
  2. Step 2: Analyze given spans

    Span A starts at 0ms and lasts 50ms, so total time is 50ms regardless of child spans.
  3. Final Answer:

    50ms -> Option A
  4. Quick Check:

    Root span duration = total request time = 50ms [OK]
Hint: Root span duration = total request time [OK]
Common Mistakes:
  • Adding child spans durations incorrectly
  • Ignoring root span duration
  • Confusing start times with total duration
4. You notice that your distributed tracing data in Jaeger shows many missing spans for some services. What is the most likely cause?
medium
A. The network latency is too low
B. The services have too many CPU cores
C. The database is down
D. The services are not propagating the trace context headers correctly

Solution

  1. Step 1: Identify cause of missing spans

    If spans are missing, it usually means trace context was not passed properly between services.
  2. Step 2: Eliminate unrelated causes

    CPU cores, database status, or low network latency do not cause missing trace spans.
  3. Final Answer:

    The services are not propagating the trace context headers correctly -> Option D
  4. Quick Check:

    Missing spans = trace context not propagated [OK]
Hint: Missing spans? Check trace context propagation [OK]
Common Mistakes:
  • Blaming unrelated system resources
  • Ignoring header propagation
  • Assuming network latency causes missing spans
5. You want to design a distributed tracing system for a microservices architecture with 100 services and high request volume. Which approach best ensures scalability and minimal overhead?
hard
A. Trace every request fully and store all spans in a single central database
B. Use sampling to trace only a subset of requests and propagate trace context with lightweight headers
C. Disable trace context propagation and log spans locally in each service
D. Use synchronous calls to the tracing backend for every span creation

Solution

  1. Step 1: Consider scalability needs

    Tracing every request fully in a large system causes high overhead and storage issues.
  2. Step 2: Identify best practice for high volume tracing

    Sampling reduces load by tracing only some requests, and lightweight headers keep propagation efficient.
  3. Step 3: Eliminate poor options

    Disabling propagation loses trace linkage; synchronous calls add latency; central DB can bottleneck.
  4. Final Answer:

    Use sampling to trace only a subset of requests and propagate trace context with lightweight headers -> Option B
  5. Quick Check:

    Sampling + lightweight headers = scalable tracing [OK]
Hint: Sampling + lightweight headers = scalable tracing [OK]
Common Mistakes:
  • Tracing all requests causing overhead
  • Ignoring trace context propagation
  • Using synchronous calls causing latency