HLDsystem_design~10 mins

Distributed tracing in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - Distributed tracing

Growth Table: Distributed Tracing at Different Scales

Users/Requests	Trace Volume	Storage Needs	Processing Load	Visualization Complexity
100 users	Low (few traces/sec)	Small (MBs/day)	Single collector handles	Simple UI, few traces
10K users	Moderate (hundreds traces/sec)	GBs/day	Multiple collectors needed	UI supports filtering, sampling
1M users	High (thousands traces/sec)	TBs/month	Distributed collectors, storage clusters	Advanced querying, sampling, aggregation
100M users	Very High (hundreds K traces/sec)	Petabytes/year	Highly distributed, sharded storage, autoscaling	AI-assisted analysis, anomaly detection

First Bottleneck

The first bottleneck is the trace data ingestion and storage. As user requests grow, the volume of trace data increases rapidly. A single collector or storage node cannot handle the write throughput and storage demands. This causes delays or data loss.

Scaling Solutions

Horizontal scaling: Add more trace collectors and storage nodes to distribute load.
Sampling: Collect only a subset of traces to reduce volume.
Aggregation: Summarize trace data to reduce storage and processing.
Sharding: Partition trace data by service or time to scale storage.
Compression: Compress trace data before storage to save space.
CDN/Edge: Use edge collectors near services to reduce network load.
Asynchronous processing: Buffer and batch trace data to smooth spikes.

Back-of-Envelope Cost Analysis

Assuming 1M users generating 1000 traces/sec, each trace ~10KB:

Data ingestion: 1000 traces/sec * 10KB = ~10MB/sec (~80Mbps)
Storage per day: 10MB/sec * 86400 sec = ~864GB/day
Storage per month: ~25TB
Collector servers: Each handles ~5000 traces/sec, so 1 server can handle ingestion load.
Storage cluster: Needs to support ~10K writes/sec with replication.

Interview Tip

Start by explaining what distributed tracing is and why it matters. Then discuss how trace volume grows with users and requests. Identify the bottleneck in ingestion and storage. Propose scaling solutions like sampling and horizontal scaling. Mention trade-offs like data loss vs cost. Finally, discuss monitoring and alerting for trace system health.

Self Check Question

Your trace database handles 1000 queries per second (QPS). Traffic grows 10x to 10,000 QPS. What do you do first and why?

Answer: Implement sampling to reduce trace volume or add read replicas and shard storage to distribute load. This prevents overload and maintains performance.

Key Result

Distributed tracing scales by managing trace data volume through sampling, horizontal scaling of collectors and storage, and data partitioning to handle increasing user traffic without loss or delay.