HLDsystem_design~25 mins

Distributed tracing in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Distributed Tracing System

Design covers trace data collection, storage, query, and visualization. Does not cover instrumentation libraries in detail or alerting systems.

Functional Requirements

FR1: Collect trace data from multiple microservices in a distributed system

FR2: Track requests as they flow through different services

FR3: Visualize the end-to-end request path with timing information

FR4: Support high throughput with minimal impact on service latency

FR5: Allow querying traces by trace ID, service name, or time range

FR6: Provide sampling to control data volume

FR7: Integrate with existing logging and monitoring tools

Non-Functional Requirements

NFR1: Handle up to 100,000 traces per second

NFR2: API response latency for trace queries under 500ms (p99)

NFR3: System availability of 99.9%

NFR4: Data retention for 7 days

NFR5: Minimal overhead on instrumented services (<5ms added latency per request)

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Instrumentation libraries for services

Trace context propagation mechanism

Trace collector or agent

Trace storage database

Indexing and query engine

Visualization UI/dashboard

Sampling and rate limiting module

Design Patterns

Context propagation pattern

Event logging and span creation

Sampling patterns (head-based, tail-based)

Data aggregation and indexing

Asynchronous data ingestion

Caching for query performance

Reference Architecture

Client --> Service A --> Service B --> Service C
  |          |           |           |
  |          |           |           +--> Trace Collector --> Trace Storage
  |          |           +--> Trace Collector
  |          +--> Trace Collector
  +--> Trace Collector

Trace Storage <--> Query Engine <--> Visualization UI

Components

Instrumentation Libraries

OpenTelemetry SDKs

Automatically capture trace spans and propagate trace context in each service

Trace Collector

gRPC/HTTP Collector Service

Receive trace data from services, batch and forward to storage

Trace Storage

Distributed NoSQL DB (e.g., Cassandra, Elasticsearch)

Store trace spans and indexes for efficient retrieval

Query Engine

Search and indexing system (e.g., Elasticsearch)

Support fast queries by trace ID, service, or time range

Visualization UI

Web Dashboard (React or similar)

Display trace timelines and service call graphs for developers

Sampling Module

In-process or Collector-based

Control volume of trace data collected to reduce overhead

Request Flow

1. 1. Client sends request to Service A with trace context headers.

2. 2. Service A instrumentation creates a span and propagates trace context to Service B.

3. 3. Service B creates its own span linked to the trace and calls Service C similarly.

4. 4. Each service sends collected spans asynchronously to the Trace Collector.

5. 5. Trace Collector batches spans and stores them in Trace Storage.

6. 6. Query Engine indexes trace data for fast retrieval.

7. 7. Developer queries traces via Visualization UI using trace ID or filters.

8. 8. UI displays the full trace timeline and service call graph.

Database Schema

Entities: - Trace: unique trace ID, start time, end time - Span: span ID, trace ID (foreign key), parent span ID, service name, operation name, start time, end time, tags/metadata Relationships: - One Trace has many Spans (1:N) - Spans linked by parent span ID forming a tree structure representing call hierarchy

Scaling Discussion

Bottlenecks

High volume of trace data causing storage and query slowdowns

Trace Collector becoming a bottleneck under heavy load

Network overhead from trace context propagation

Query latency increasing with data size

Sampling misconfiguration leading to too much or too little data

Solutions

Use scalable distributed storage like Cassandra or Elasticsearch clusters

Deploy multiple Trace Collector instances with load balancing

Optimize instrumentation to propagate minimal trace context data

Implement indexing strategies and caching in Query Engine

Apply adaptive sampling strategies to balance data volume and coverage

Interview Tips

Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Explain trace context propagation clearly with examples

Discuss trade-offs of sampling strategies

Justify technology choices for storage and query

Highlight how to minimize performance impact on services

Describe how visualization helps developers debug distributed systems