| Users / Requests | Trace Volume | Storage Needs | Processing Load | Visualization Complexity |
|---|---|---|---|---|
| 100 users | Low (few traces per second) | Minimal, local storage | Single Jaeger/Zipkin instance | Simple trace views |
| 10,000 users | Moderate (hundreds traces/sec) | Increased storage, possibly remote DB | Multiple collectors, basic load balancing | More complex trace aggregation |
| 1,000,000 users | High (thousands traces/sec) | Distributed storage (Cassandra, Elasticsearch) | Horizontal scaling of collectors and query services | Advanced UI filtering and sampling needed |
| 100,000,000 users | Very High (tens of thousands traces/sec) | Sharded, multi-region storage clusters | Highly scalable, multi-tenant tracing infrastructure | Automated anomaly detection, AI-assisted analysis |
Distributed tracing (Jaeger, Zipkin) in Microservices - Scalability & System Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
The first bottleneck is the storage backend for trace data. As trace volume grows, the database that stores spans and traces becomes overwhelmed by write and read requests. This causes delays in trace ingestion and slow query responses.
- Horizontal scaling: Add more collector and query service instances behind load balancers to handle increased traffic.
- Storage optimization: Use scalable distributed databases like Cassandra or Elasticsearch with sharding and replication.
- Sampling: Reduce data volume by sampling traces (e.g., only 10% of requests traced).
- Caching: Cache frequent query results to reduce load on storage.
- Data retention policies: Archive or delete old traces to save storage space.
- Multi-region deployment: Deploy tracing infrastructure closer to services to reduce latency and bandwidth.
Assuming 1 million users generating 10,000 traces per second, each trace averaging 10 spans of 1KB each:
- Trace data per second: 10,000 traces * 10 spans * 1KB = 100MB/s
- Storage per day: 100MB/s * 3600 * 24 ≈ 8.6TB/day
- Network bandwidth: Need >1Gbps links to handle ingestion
- Database QPS: Storage must handle ~100,000 writes/sec (spans)
- Collector servers: Multiple instances needed to handle ingestion load
Start by explaining what distributed tracing solves in microservices. Then discuss how trace data volume grows with users and requests. Identify the storage backend as the first bottleneck. Propose sampling and horizontal scaling of collectors and storage. Mention trade-offs like data retention and query latency. Finish with how to monitor and optimize the tracing system itself.
Your tracing database handles 1000 writes per second. Traffic grows 10x to 10,000 writes per second. What do you do first and why?
Answer: Implement sampling to reduce the number of traces stored, and horizontally scale the storage backend with sharding or replicas to handle increased write load. This prevents the database from becoming a bottleneck.
Practice
Jaeger or Zipkin in microservices?Solution
Step 1: Understand the role of distributed tracing
Distributed tracing tools help monitor how requests move through different microservices by collecting timing and metadata.Step 2: Identify the main function of Jaeger and Zipkin
They visualize and analyze traces made of spans to find bottlenecks or errors in service chains.Final Answer:
To track and visualize requests as they flow through multiple services -> Option AQuick Check:
Distributed tracing = track requests flow [OK]
- Confusing tracing with data storage
- Thinking tracing replaces load balancers
- Assuming tracing encrypts traffic
Solution
Step 1: Recall standard trace context headers
Distributed tracing uses specific headers likeX-B3-TraceIdandX-B3-SpanIdto pass trace info between services.Step 2: Identify correct header usage
Headers likeAuthorization,Content-Type, orCookieare unrelated to tracing context propagation.Final Answer:
Add X-B3-TraceId and X-B3-SpanId headers to the outgoing request -> Option CQuick Check:
Trace context headers = X-B3-TraceId, X-B3-SpanId [OK]
- Using unrelated HTTP headers for trace context
- Forgetting to propagate span ID
- Confusing trace ID with authentication tokens
Span A (root): start=0ms, duration=50ms Span B (child of A): start=10ms, duration=20ms Span C (child of A): start=35ms, duration=10ms
Solution
Step 1: Understand root span duration
The root span duration represents the total time of the entire request, including child spans.Step 2: Analyze given spans
Span A starts at 0ms and lasts 50ms, so total time is 50ms regardless of child spans.Final Answer:
50ms -> Option AQuick Check:
Root span duration = total request time = 50ms [OK]
- Adding child spans durations incorrectly
- Ignoring root span duration
- Confusing start times with total duration
Solution
Step 1: Identify cause of missing spans
If spans are missing, it usually means trace context was not passed properly between services.Step 2: Eliminate unrelated causes
CPU cores, database status, or low network latency do not cause missing trace spans.Final Answer:
The services are not propagating the trace context headers correctly -> Option DQuick Check:
Missing spans = trace context not propagated [OK]
- Blaming unrelated system resources
- Ignoring header propagation
- Assuming network latency causes missing spans
Solution
Step 1: Consider scalability needs
Tracing every request fully in a large system causes high overhead and storage issues.Step 2: Identify best practice for high volume tracing
Sampling reduces load by tracing only some requests, and lightweight headers keep propagation efficient.Step 3: Eliminate poor options
Disabling propagation loses trace linkage; synchronous calls add latency; central DB can bottleneck.Final Answer:
Use sampling to trace only a subset of requests and propagate trace context with lightweight headers -> Option BQuick Check:
Sampling + lightweight headers = scalable tracing [OK]
- Tracing all requests causing overhead
- Ignoring trace context propagation
- Using synchronous calls causing latency
