0
0
Microservicessystem_design~10 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Distributed tracing (Jaeger, Zipkin)
Growth Table: Distributed Tracing at Different Scales
Users / RequestsTrace VolumeStorage NeedsProcessing LoadVisualization Complexity
100 usersLow (few traces per second)Minimal, local storageSingle Jaeger/Zipkin instanceSimple trace views
10,000 usersModerate (hundreds traces/sec)Increased storage, possibly remote DBMultiple collectors, basic load balancingMore complex trace aggregation
1,000,000 usersHigh (thousands traces/sec)Distributed storage (Cassandra, Elasticsearch)Horizontal scaling of collectors and query servicesAdvanced UI filtering and sampling needed
100,000,000 usersVery High (tens of thousands traces/sec)Sharded, multi-region storage clustersHighly scalable, multi-tenant tracing infrastructureAutomated anomaly detection, AI-assisted analysis
First Bottleneck

The first bottleneck is the storage backend for trace data. As trace volume grows, the database that stores spans and traces becomes overwhelmed by write and read requests. This causes delays in trace ingestion and slow query responses.

Scaling Solutions
  • Horizontal scaling: Add more collector and query service instances behind load balancers to handle increased traffic.
  • Storage optimization: Use scalable distributed databases like Cassandra or Elasticsearch with sharding and replication.
  • Sampling: Reduce data volume by sampling traces (e.g., only 10% of requests traced).
  • Caching: Cache frequent query results to reduce load on storage.
  • Data retention policies: Archive or delete old traces to save storage space.
  • Multi-region deployment: Deploy tracing infrastructure closer to services to reduce latency and bandwidth.
Back-of-Envelope Cost Analysis

Assuming 1 million users generating 10,000 traces per second, each trace averaging 10 spans of 1KB each:

  • Trace data per second: 10,000 traces * 10 spans * 1KB = 100MB/s
  • Storage per day: 100MB/s * 3600 * 24 ≈ 8.6TB/day
  • Network bandwidth: Need >1Gbps links to handle ingestion
  • Database QPS: Storage must handle ~100,000 writes/sec (spans)
  • Collector servers: Multiple instances needed to handle ingestion load
Interview Tip

Start by explaining what distributed tracing solves in microservices. Then discuss how trace data volume grows with users and requests. Identify the storage backend as the first bottleneck. Propose sampling and horizontal scaling of collectors and storage. Mention trade-offs like data retention and query latency. Finish with how to monitor and optimize the tracing system itself.

Self Check Question

Your tracing database handles 1000 writes per second. Traffic grows 10x to 10,000 writes per second. What do you do first and why?

Answer: Implement sampling to reduce the number of traces stored, and horizontally scale the storage backend with sharding or replicas to handle increased write load. This prevents the database from becoming a bottleneck.

Key Result
Distributed tracing scales well initially but storage backend becomes the first bottleneck as trace volume grows. Sampling and horizontal scaling of storage and collectors are key to handle millions of traces per second.