0
0
HLDsystem_design~10 mins

Distributed tracing in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Distributed tracing
Growth Table: Distributed Tracing at Different Scales
Users/RequestsTrace VolumeStorage NeedsProcessing LoadVisualization Complexity
100 usersLow (few traces/sec)Small (MBs/day)Single collector handlesSimple UI, few traces
10K usersModerate (hundreds traces/sec)GBs/dayMultiple collectors neededUI supports filtering, sampling
1M usersHigh (thousands traces/sec)TBs/monthDistributed collectors, storage clustersAdvanced querying, sampling, aggregation
100M usersVery High (hundreds K traces/sec)Petabytes/yearHighly distributed, sharded storage, autoscalingAI-assisted analysis, anomaly detection
First Bottleneck

The first bottleneck is the trace data ingestion and storage. As user requests grow, the volume of trace data increases rapidly. A single collector or storage node cannot handle the write throughput and storage demands. This causes delays or data loss.

Scaling Solutions
  • Horizontal scaling: Add more trace collectors and storage nodes to distribute load.
  • Sampling: Collect only a subset of traces to reduce volume.
  • Aggregation: Summarize trace data to reduce storage and processing.
  • Sharding: Partition trace data by service or time to scale storage.
  • Compression: Compress trace data before storage to save space.
  • CDN/Edge: Use edge collectors near services to reduce network load.
  • Asynchronous processing: Buffer and batch trace data to smooth spikes.
Back-of-Envelope Cost Analysis

Assuming 1M users generating 1000 traces/sec, each trace ~10KB:

  • Data ingestion: 1000 traces/sec * 10KB = ~10MB/sec (~80Mbps)
  • Storage per day: 10MB/sec * 86400 sec = ~864GB/day
  • Storage per month: ~25TB
  • Collector servers: Each handles ~5000 traces/sec, so 1 server can handle ingestion load.
  • Storage cluster: Needs to support ~10K writes/sec with replication.
Interview Tip

Start by explaining what distributed tracing is and why it matters. Then discuss how trace volume grows with users and requests. Identify the bottleneck in ingestion and storage. Propose scaling solutions like sampling and horizontal scaling. Mention trade-offs like data loss vs cost. Finally, discuss monitoring and alerting for trace system health.

Self Check Question

Your trace database handles 1000 queries per second (QPS). Traffic grows 10x to 10,000 QPS. What do you do first and why?

Answer: Implement sampling to reduce trace volume or add read replicas and shard storage to distribute load. This prevents overload and maintains performance.

Key Result
Distributed tracing scales by managing trace data volume through sampling, horizontal scaling of collectors and storage, and data partitioning to handle increasing user traffic without loss or delay.