0
0
HLDsystem_design~25 mins

Distributed tracing in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Distributed Tracing System
Design covers trace data collection, storage, query, and visualization. Does not cover instrumentation libraries in detail or alerting systems.
Functional Requirements
FR1: Collect trace data from multiple microservices in a distributed system
FR2: Track requests as they flow through different services
FR3: Visualize the end-to-end request path with timing information
FR4: Support high throughput with minimal impact on service latency
FR5: Allow querying traces by trace ID, service name, or time range
FR6: Provide sampling to control data volume
FR7: Integrate with existing logging and monitoring tools
Non-Functional Requirements
NFR1: Handle up to 100,000 traces per second
NFR2: API response latency for trace queries under 500ms (p99)
NFR3: System availability of 99.9%
NFR4: Data retention for 7 days
NFR5: Minimal overhead on instrumented services (<5ms added latency per request)
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Instrumentation libraries for services
Trace context propagation mechanism
Trace collector or agent
Trace storage database
Indexing and query engine
Visualization UI/dashboard
Sampling and rate limiting module
Design Patterns
Context propagation pattern
Event logging and span creation
Sampling patterns (head-based, tail-based)
Data aggregation and indexing
Asynchronous data ingestion
Caching for query performance
Reference Architecture
Client --> Service A --> Service B --> Service C
  |          |           |           |
  |          |           |           +--> Trace Collector --> Trace Storage
  |          |           +--> Trace Collector
  |          +--> Trace Collector
  +--> Trace Collector

Trace Storage <--> Query Engine <--> Visualization UI
Components
Instrumentation Libraries
OpenTelemetry SDKs
Automatically capture trace spans and propagate trace context in each service
Trace Collector
gRPC/HTTP Collector Service
Receive trace data from services, batch and forward to storage
Trace Storage
Distributed NoSQL DB (e.g., Cassandra, Elasticsearch)
Store trace spans and indexes for efficient retrieval
Query Engine
Search and indexing system (e.g., Elasticsearch)
Support fast queries by trace ID, service, or time range
Visualization UI
Web Dashboard (React or similar)
Display trace timelines and service call graphs for developers
Sampling Module
In-process or Collector-based
Control volume of trace data collected to reduce overhead
Request Flow
1. 1. Client sends request to Service A with trace context headers.
2. 2. Service A instrumentation creates a span and propagates trace context to Service B.
3. 3. Service B creates its own span linked to the trace and calls Service C similarly.
4. 4. Each service sends collected spans asynchronously to the Trace Collector.
5. 5. Trace Collector batches spans and stores them in Trace Storage.
6. 6. Query Engine indexes trace data for fast retrieval.
7. 7. Developer queries traces via Visualization UI using trace ID or filters.
8. 8. UI displays the full trace timeline and service call graph.
Database Schema
Entities: - Trace: unique trace ID, start time, end time - Span: span ID, trace ID (foreign key), parent span ID, service name, operation name, start time, end time, tags/metadata Relationships: - One Trace has many Spans (1:N) - Spans linked by parent span ID forming a tree structure representing call hierarchy
Scaling Discussion
Bottlenecks
High volume of trace data causing storage and query slowdowns
Trace Collector becoming a bottleneck under heavy load
Network overhead from trace context propagation
Query latency increasing with data size
Sampling misconfiguration leading to too much or too little data
Solutions
Use scalable distributed storage like Cassandra or Elasticsearch clusters
Deploy multiple Trace Collector instances with load balancing
Optimize instrumentation to propagate minimal trace context data
Implement indexing strategies and caching in Query Engine
Apply adaptive sampling strategies to balance data volume and coverage
Interview Tips
Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain trace context propagation clearly with examples
Discuss trade-offs of sampling strategies
Justify technology choices for storage and query
Highlight how to minimize performance impact on services
Describe how visualization helps developers debug distributed systems