| Users/Requests | Trace Volume | Storage Needs | Processing Load | Visualization Complexity |
|---|---|---|---|---|
| 100 users | Low (few traces/sec) | Small (MBs/day) | Single collector handles | Simple UI, few traces |
| 10K users | Moderate (hundreds traces/sec) | GBs/day | Multiple collectors needed | UI supports filtering, sampling |
| 1M users | High (thousands traces/sec) | TBs/month | Distributed collectors, storage clusters | Advanced querying, sampling, aggregation |
| 100M users | Very High (hundreds K traces/sec) | Petabytes/year | Highly distributed, sharded storage, autoscaling | AI-assisted analysis, anomaly detection |
Distributed tracing in HLD - Scalability & System Analysis
The first bottleneck is the trace data ingestion and storage. As user requests grow, the volume of trace data increases rapidly. A single collector or storage node cannot handle the write throughput and storage demands. This causes delays or data loss.
- Horizontal scaling: Add more trace collectors and storage nodes to distribute load.
- Sampling: Collect only a subset of traces to reduce volume.
- Aggregation: Summarize trace data to reduce storage and processing.
- Sharding: Partition trace data by service or time to scale storage.
- Compression: Compress trace data before storage to save space.
- CDN/Edge: Use edge collectors near services to reduce network load.
- Asynchronous processing: Buffer and batch trace data to smooth spikes.
Assuming 1M users generating 1000 traces/sec, each trace ~10KB:
- Data ingestion: 1000 traces/sec * 10KB = ~10MB/sec (~80Mbps)
- Storage per day: 10MB/sec * 86400 sec = ~864GB/day
- Storage per month: ~25TB
- Collector servers: Each handles ~5000 traces/sec, so 1 server can handle ingestion load.
- Storage cluster: Needs to support ~10K writes/sec with replication.
Start by explaining what distributed tracing is and why it matters. Then discuss how trace volume grows with users and requests. Identify the bottleneck in ingestion and storage. Propose scaling solutions like sampling and horizontal scaling. Mention trade-offs like data loss vs cost. Finally, discuss monitoring and alerting for trace system health.
Your trace database handles 1000 queries per second (QPS). Traffic grows 10x to 10,000 QPS. What do you do first and why?
Answer: Implement sampling to reduce trace volume or add read replicas and shard storage to distribute load. This prevents overload and maintains performance.