0
0
Microservicessystem_design~25 mins

Distributed tracing (Jaeger, Zipkin) in Microservices - System Design Exercise

Choose your learning style9 modes available
Design: Distributed Tracing System for Microservices
Design the tracing collection, storage, and visualization system. Instrumentation libraries and microservice code changes are out of scope.
Functional Requirements
FR1: Trace requests as they flow through multiple microservices
FR2: Collect timing and metadata for each service call
FR3: Visualize traces to identify latency and errors
FR4: Support high throughput with minimal overhead
FR5: Allow querying traces by trace ID, service, or time range
FR6: Integrate with existing microservices without major code changes
Non-Functional Requirements
NFR1: Handle up to 100,000 traces per second
NFR2: End-to-end trace latency under 500ms for visualization
NFR3: 99.9% system availability
NFR4: Minimal impact on microservice performance (less than 5% overhead)
NFR5: Data retention for 7 days
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Instrumentation libraries for trace context and span creation
Trace collector agents or gateways
Storage backend (e.g., Elasticsearch, Cassandra)
Query API and UI for trace visualization
Sampling strategies to reduce data volume
Context propagation mechanisms (HTTP headers, gRPC metadata)
Design Patterns
Context propagation pattern
Sampling pattern (head-based, tail-based)
Fan-out and aggregation of trace data
Asynchronous data ingestion
Data retention and archival
Reference Architecture
  +----------------+       +----------------+       +----------------+
  | Microservice A |-----> | Microservice B |-----> | Microservice C |
  +----------------+       +----------------+       +----------------+
          |                        |                        |
          |                        |                        |
          v                        v                        v
  +---------------------------------------------------------------+
  |                      Instrumentation Libraries                |
  +---------------------------------------------------------------+
                                |
                                v
                      +--------------------+
                      | Trace Collector(s)  |
                      | (Jaeger/Zipkin Agent)|
                      +--------------------+
                                |
                                v
                      +--------------------+
                      |   Storage Backend   |
                      | (Elasticsearch/Cassandra) |
                      +--------------------+
                                |
                                v
                      +--------------------+
                      | Query API & UI     |
                      | (Jaeger/Zipkin UI) |
                      +--------------------+
Components
Instrumentation Libraries
OpenTelemetry SDKs
Automatically create and propagate trace context and spans in microservices
Trace Collector
Jaeger Agent or Zipkin Collector
Receive trace data from services and forward to storage
Storage Backend
Elasticsearch or Cassandra
Store trace and span data for querying and retention
Query API and UI
Jaeger UI or Zipkin UI
Allow users to search, view, and analyze traces
Request Flow
1. 1. A user request enters Microservice A, instrumentation library creates a new trace and span.
2. 2. Trace context is injected into outgoing requests to Microservice B.
3. 3. Microservice B extracts trace context, creates child spans for its operations.
4. 4. This continues through Microservice C and others, each adding spans.
5. 5. Instrumentation libraries send spans asynchronously to the Trace Collector.
6. 6. Trace Collector batches and stores spans in the Storage Backend.
7. 7. User queries the Query API/UI to retrieve and visualize traces by trace ID or filters.
8. 8. UI displays the trace timeline, showing latency and errors across services.
Database Schema
Entities: - Trace: Unique trace ID, start time, end time, status - Span: Span ID, parent span ID, trace ID, service name, operation name, start time, duration, tags, logs Relationships: - One Trace has many Spans (1:N) - Spans linked by parent span ID to form a tree representing call hierarchy
Scaling Discussion
Bottlenecks
High volume of trace data causing storage overload
Trace Collector becoming a bottleneck under heavy load
Query latency increasing with large data size
Network overhead from trace data transmission
Instrumentation overhead impacting microservice performance
Solutions
Implement sampling strategies to reduce trace volume (e.g., probabilistic sampling)
Scale Trace Collectors horizontally with load balancing
Use scalable storage solutions optimized for time-series and search (Elasticsearch clusters, Cassandra)
Compress trace data and batch transmissions to reduce network load
Optimize instrumentation to minimize synchronous calls and use asynchronous reporting
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing components and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Explain how trace context propagates across microservices
Discuss trade-offs of sampling vs full tracing
Describe storage choices and query patterns
Highlight how visualization helps identify latency and errors
Address performance impact and how to minimize it
Show awareness of scaling challenges and solutions