| Users/Traffic | Metrics | Logs | Traces |
|---|---|---|---|
| 100 users | Basic CPU, memory, request counts collected on few services | Logs stored locally, simple text files, manual inspection | Traces sampled at low rate, few services instrumented |
| 10K users | Centralized metrics collection with Prometheus or similar; alerting added | Logs shipped to central system (e.g., ELK stack); indexing starts | Distributed tracing enabled on key services; sampling rate increased |
| 1M users | High cardinality metrics; long-term storage; aggregation and downsampling | Logs volume grows; need log retention policies and archiving; indexing optimized | Traces collected for most requests; storage and query performance optimized |
| 100M users | Metrics sharded and federated; multi-tenant isolation; advanced anomaly detection | Logs stored in scalable object storage; cold and hot storage tiers; AI-based log analysis | Traces sampled intelligently; trace data linked with metrics and logs for root cause |
Three pillars (metrics, logs, traces) in Microservices - Scalability & System Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
At small scale, logs stored locally become hard to manage and search as volume grows.
At medium scale, centralized logging systems face storage and indexing bottlenecks due to high log volume.
At large scale, trace data storage and query performance degrade because traces are large and complex.
Overall, the first bottleneck is usually the logging infrastructure because logs grow fastest and require heavy indexing.
- Metrics: Use aggregation, downsampling, and sharding; employ time-series databases optimized for high cardinality.
- Logs: Implement centralized log management with scalable storage (e.g., Elasticsearch clusters, cloud object storage); apply log retention and archiving policies; use indexing and compression.
- Traces: Use sampling strategies to reduce volume; store traces in specialized databases; correlate traces with metrics and logs for efficient debugging.
- General: Use horizontal scaling for collectors and storage; apply caching and tiered storage; automate alerting and anomaly detection.
Assuming 1M users generating 10 requests/sec each:
- Total requests: 10 million/sec
- Metrics: 1-10 million data points/sec; requires high-throughput TSDB (e.g., Prometheus, Cortex)
- Logs: Each request generates ~1KB logs -> ~10GB/sec raw logs; needs compression and tiered storage
- Traces: Sampling 1% -> 100K traces/sec; each trace ~10KB -> ~1GB/sec storage
- Network: High bandwidth needed for shipping logs and traces; consider local aggregation
Structure your scalability discussion by:
- Explaining the role of each pillar (metrics, logs, traces) in observability.
- Describing how data volume grows with users and requests.
- Identifying bottlenecks in storage, indexing, and query performance.
- Suggesting concrete scaling solutions like sampling, sharding, and tiered storage.
- Discussing trade-offs between data fidelity and cost.
Your database handles 1000 QPS for logs. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Implement log sampling or filtering to reduce volume, then scale the logging database horizontally with sharding or add replicas to handle increased write load.
Practice
metrics in microservices monitoring?Solution
Step 1: Understand what metrics represent
Metrics are numerical measurements like CPU usage, request counts, or latency that show system health over time.Step 2: Differentiate metrics from logs and traces
Logs record events, traces follow request paths, but metrics summarize performance data.Final Answer:
They provide numerical data about system performance over time. -> Option CQuick Check:
Metrics = numerical performance data [OK]
- Confusing metrics with logs as event records
- Thinking traces are numerical data
- Assuming metrics store configurations
Solution
Step 1: Identify standard log formats
JSON format is widely used for structured logs in microservices for easy parsing and querying.Step 2: Compare options for correctness
{"timestamp": "2024-06-01T12:00:00Z", "level": "ERROR", "message": "Failed to connect"} is a valid JSON log entry with timestamp, level, and message fields. Others are less structured or not JSON.Final Answer:
{"timestamp": "2024-06-01T12:00:00Z", "level": "ERROR", "message": "Failed to connect"} -> Option BQuick Check:
Structured JSON logs = {"timestamp": "2024-06-01T12:00:00Z", "level": "ERROR", "message": "Failed to connect"} [OK]
- Using unstructured plain text logs
- Confusing XML-like logs with JSON
- Ignoring timestamp or level fields
{
"traceId": "abc123",
"spans": [
{"service": "A", "duration_ms": 50},
{"service": "B", "duration_ms": 30},
{"service": "C", "duration_ms": 20}
]
}Solution
Step 1: Understand trace spans and durations
Each span shows time spent in a service. Total time is sum if services are sequential.Step 2: Sum durations of all spans
50 ms + 30 ms + 20 ms = 100 ms total processing time.Final Answer:
100 ms -> Option AQuick Check:
Sum spans durations = 100 ms [OK]
- Taking only the longest span as total time
- Ignoring some spans in calculation
- Confusing traceId with duration
Solution
Step 1: Understand trace ID propagation
Trace IDs must be passed along service calls to link logs and traces.Step 2: Identify cause of missing trace IDs
If trace context is not propagated, logs won't have trace IDs, breaking trace-log correlation.Final Answer:
Trace context is not propagated between services. -> Option DQuick Check:
Missing trace IDs = missing context propagation [OK]
- Confusing metrics with trace IDs
- Assuming storage location causes missing IDs
- Blaming programming language differences
Solution
Step 1: Identify best practices for scalable monitoring
Centralized systems like Prometheus for metrics, ELK for logs, and OpenTelemetry for traces are industry standards for scalability and analysis.Step 2: Evaluate options for scalability and effectiveness
Local storage limits analysis and scalability; ignoring logs/traces loses insights; sending raw data to clients is inefficient and insecure.Final Answer:
Use a centralized monitoring system that collects metrics via Prometheus, logs via ELK stack, and traces via OpenTelemetry. -> Option AQuick Check:
Centralized, specialized tools = scalable monitoring [OK]
- Storing logs/traces locally only
- Ignoring logs or traces
- Sending raw data directly to clients
