Microservicessystem_design~10 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - Scalability & System Analysis

Choose your learning style10 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Scalability Analysis - Distributed tracing (Jaeger, Zipkin)

Growth Table: Distributed Tracing at Different Scales

Users / Requests	Trace Volume	Storage Needs	Processing Load	Visualization Complexity
100 users	Low (few traces per second)	Minimal, local storage	Single Jaeger/Zipkin instance	Simple trace views
10,000 users	Moderate (hundreds traces/sec)	Increased storage, possibly remote DB	Multiple collectors, basic load balancing	More complex trace aggregation
1,000,000 users	High (thousands traces/sec)	Distributed storage (Cassandra, Elasticsearch)	Horizontal scaling of collectors and query services	Advanced UI filtering and sampling needed
100,000,000 users	Very High (tens of thousands traces/sec)	Sharded, multi-region storage clusters	Highly scalable, multi-tenant tracing infrastructure	Automated anomaly detection, AI-assisted analysis

First Bottleneck

The first bottleneck is the storage backend for trace data. As trace volume grows, the database that stores spans and traces becomes overwhelmed by write and read requests. This causes delays in trace ingestion and slow query responses.

Scaling Solutions

Horizontal scaling: Add more collector and query service instances behind load balancers to handle increased traffic.
Storage optimization: Use scalable distributed databases like Cassandra or Elasticsearch with sharding and replication.
Sampling: Reduce data volume by sampling traces (e.g., only 10% of requests traced).
Caching: Cache frequent query results to reduce load on storage.
Data retention policies: Archive or delete old traces to save storage space.
Multi-region deployment: Deploy tracing infrastructure closer to services to reduce latency and bandwidth.

Back-of-Envelope Cost Analysis

Assuming 1 million users generating 10,000 traces per second, each trace averaging 10 spans of 1KB each:

Trace data per second: 10,000 traces * 10 spans * 1KB = 100MB/s
Storage per day: 100MB/s * 3600 * 24 ≈ 8.6TB/day
Network bandwidth: Need >1Gbps links to handle ingestion
Database QPS: Storage must handle ~100,000 writes/sec (spans)
Collector servers: Multiple instances needed to handle ingestion load

Interview Tip

Start by explaining what distributed tracing solves in microservices. Then discuss how trace data volume grows with users and requests. Identify the storage backend as the first bottleneck. Propose sampling and horizontal scaling of collectors and storage. Mention trade-offs like data retention and query latency. Finish with how to monitor and optimize the tracing system itself.

Self Check Question

Your tracing database handles 1000 writes per second. Traffic grows 10x to 10,000 writes per second. What do you do first and why?

Answer: Implement sampling to reduce the number of traces stored, and horizontally scale the storage backend with sharding or replicas to handle increased write load. This prevents the database from becoming a bottleneck.

Key Result

Distributed tracing scales well initially but storage backend becomes the first bottleneck as trace volume grows. Sampling and horizontal scaling of storage and collectors are key to handle millions of traces per second.

Practice

(1/5)

1. What is the main purpose of distributed tracing tools like Jaeger or Zipkin in microservices?

easy

A. To track and visualize requests as they flow through multiple services

B. To store large amounts of user data securely

C. To replace load balancers in service communication

D. To encrypt network traffic between microservices

Distributed tracing (Jaeger, Zipkin) in Microservices - Scalability & System Analysis

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of distributed tracing

Step 2: Identify the main function of Jaeger and Zipkin

Final Answer:

Quick Check:

Solution

Step 1: Recall standard trace context headers

Step 2: Identify correct header usage

Final Answer:

Quick Check:

Solution

Step 1: Understand root span duration

Step 2: Analyze given spans

Final Answer:

Quick Check:

Solution

Step 1: Identify cause of missing spans

Step 2: Eliminate unrelated causes

Final Answer:

Quick Check:

Solution

Step 1: Consider scalability needs

Step 2: Identify best practice for high volume tracing

Step 3: Eliminate poor options

Final Answer:

Quick Check: