| Users / Requests | Trace Volume | Storage Needs | Processing Load | Visualization Complexity |
|---|---|---|---|---|
| 100 users | Low (few traces per second) | Minimal, local storage | Single Jaeger/Zipkin instance | Simple trace views |
| 10,000 users | Moderate (hundreds traces/sec) | Increased storage, possibly remote DB | Multiple collectors, basic load balancing | More complex trace aggregation |
| 1,000,000 users | High (thousands traces/sec) | Distributed storage (Cassandra, Elasticsearch) | Horizontal scaling of collectors and query services | Advanced UI filtering and sampling needed |
| 100,000,000 users | Very High (tens of thousands traces/sec) | Sharded, multi-region storage clusters | Highly scalable, multi-tenant tracing infrastructure | Automated anomaly detection, AI-assisted analysis |
Distributed tracing (Jaeger, Zipkin) in Microservices - Scalability & System Analysis
The first bottleneck is the storage backend for trace data. As trace volume grows, the database that stores spans and traces becomes overwhelmed by write and read requests. This causes delays in trace ingestion and slow query responses.
- Horizontal scaling: Add more collector and query service instances behind load balancers to handle increased traffic.
- Storage optimization: Use scalable distributed databases like Cassandra or Elasticsearch with sharding and replication.
- Sampling: Reduce data volume by sampling traces (e.g., only 10% of requests traced).
- Caching: Cache frequent query results to reduce load on storage.
- Data retention policies: Archive or delete old traces to save storage space.
- Multi-region deployment: Deploy tracing infrastructure closer to services to reduce latency and bandwidth.
Assuming 1 million users generating 10,000 traces per second, each trace averaging 10 spans of 1KB each:
- Trace data per second: 10,000 traces * 10 spans * 1KB = 100MB/s
- Storage per day: 100MB/s * 3600 * 24 ≈ 8.6TB/day
- Network bandwidth: Need >1Gbps links to handle ingestion
- Database QPS: Storage must handle ~100,000 writes/sec (spans)
- Collector servers: Multiple instances needed to handle ingestion load
Start by explaining what distributed tracing solves in microservices. Then discuss how trace data volume grows with users and requests. Identify the storage backend as the first bottleneck. Propose sampling and horizontal scaling of collectors and storage. Mention trade-offs like data retention and query latency. Finish with how to monitor and optimize the tracing system itself.
Your tracing database handles 1000 writes per second. Traffic grows 10x to 10,000 writes per second. What do you do first and why?
Answer: Implement sampling to reduce the number of traces stored, and horizontally scale the storage backend with sharding or replicas to handle increased write load. This prevents the database from becoming a bottleneck.