0
0
HLDsystem_design~25 mins

Logging strategies in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Logging System for Distributed Applications
Design focuses on log collection, storage, search, and alerting. Excludes detailed UI design and log analysis algorithms.
Functional Requirements
FR1: Collect logs from multiple services and servers
FR2: Support different log levels (info, warning, error, debug)
FR3: Allow searching and filtering logs by time, service, and level
FR4: Ensure logs are stored reliably and durably
FR5: Provide real-time monitoring and alerting on critical errors
FR6: Support high write throughput (up to 100,000 logs per second)
FR7: Allow log retention policies and archiving
Non-Functional Requirements
NFR1: System must handle 100K log entries per second
NFR2: Search queries should return results within 2 seconds
NFR3: System availability must be 99.9%
NFR4: Logs must be stored for at least 30 days before archiving
NFR5: Latency for log ingestion should be under 500ms
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
Key Components
Log collectors/agents on servers
Message queue or streaming platform for log transport
Centralized log storage (e.g., Elasticsearch, cloud storage)
Indexing and search engine
Alerting and monitoring system
Log archiving and retention manager
Design Patterns
Log aggregation pattern
Event streaming with backpressure handling
Tiered storage for hot and cold logs
Circuit breaker for log ingestion failures
Structured logging and log enrichment
Reference Architecture
  +-------------+       +----------------+       +-------------------+       +----------------+
  | Application | ----> | Log Collector  | ----> | Message Queue     | ----> | Log Storage    |
  | Services    |       | (Agent)        |       | (Kafka/RabbitMQ)  |       | (Elasticsearch) |
  +-------------+       +----------------+       +-------------------+       +----------------+
                                                                                  |
                                                                                  v
                                                                           +----------------+
                                                                           | Alerting &     |
                                                                           | Monitoring     |
                                                                           +----------------+
Components
Log Collector (Agent)
Fluentd, Logstash, or custom agent
Collect logs from applications and forward them reliably
Message Queue
Apache Kafka or RabbitMQ
Buffer and transport logs asynchronously to storage
Log Storage
Elasticsearch or similar search engine
Store, index, and allow fast search of logs
Alerting & Monitoring
Prometheus + Alertmanager or custom system
Monitor logs for critical errors and send alerts
Retention Manager
Custom scripts or lifecycle policies
Manage log retention and archiving to cheaper storage
Request Flow
1. 1. Application services generate logs with structured format and levels.
2. 2. Log collectors running on servers capture logs and batch them.
3. 3. Logs are sent asynchronously to a message queue to handle bursts.
4. 4. Consumers read logs from the queue and index them into Elasticsearch.
5. 5. Users query Elasticsearch to search and filter logs by criteria.
6. 6. Alerting system monitors logs for error patterns and triggers notifications.
7. 7. Retention manager deletes or archives logs older than retention period.
Database Schema
Entities: LogEntry(id, timestamp, service_name, log_level, message, metadata_json), Service(id, name, owner), Alert(id, condition, severity, status). Relationships: LogEntry linked to Service by service_name; Alerts configured on LogEntry patterns.
Scaling Discussion
Bottlenecks
Message queue throughput limits under high log volume
Storage capacity and indexing speed in Elasticsearch
Search query latency with large datasets
Alerting system overload with many error events
Solutions
Partition message queue topics and scale consumers horizontally
Use tiered storage: hot nodes for recent logs, cold storage for older logs
Implement query caching and limit query scope for faster responses
Use sampling and rate limiting in alerting to reduce noise
Interview Tips
Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Emphasize reliability and durability of log collection
Discuss asynchronous log transport to handle bursts
Explain indexing and search for fast log retrieval
Highlight alerting importance for operational monitoring
Address scaling challenges and mitigation strategies