HLDsystem_design~25 mins

Logging strategies in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Logging System for Distributed Applications

Design focuses on log collection, storage, search, and alerting. Excludes detailed UI design and log analysis algorithms.

Functional Requirements

FR1: Collect logs from multiple services and servers

FR2: Support different log levels (info, warning, error, debug)

FR3: Allow searching and filtering logs by time, service, and level

FR4: Ensure logs are stored reliably and durably

FR5: Provide real-time monitoring and alerting on critical errors

FR6: Support high write throughput (up to 100,000 logs per second)

FR7: Allow log retention policies and archiving

Non-Functional Requirements

NFR1: System must handle 100K log entries per second

NFR2: Search queries should return results within 2 seconds

NFR3: System availability must be 99.9%

NFR4: Logs must be stored for at least 30 days before archiving

NFR5: Latency for log ingestion should be under 500ms

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

Key Components

Log collectors/agents on servers

Message queue or streaming platform for log transport

Centralized log storage (e.g., Elasticsearch, cloud storage)

Indexing and search engine

Alerting and monitoring system

Log archiving and retention manager

Design Patterns

Log aggregation pattern

Event streaming with backpressure handling

Tiered storage for hot and cold logs

Circuit breaker for log ingestion failures

Structured logging and log enrichment

Reference Architecture

  +-------------+       +----------------+       +-------------------+       +----------------+
  | Application | ----> | Log Collector  | ----> | Message Queue     | ----> | Log Storage    |
  | Services    |       | (Agent)        |       | (Kafka/RabbitMQ)  |       | (Elasticsearch) |
  +-------------+       +----------------+       +-------------------+       +----------------+
                                                                                  |
                                                                                  v
                                                                           +----------------+
                                                                           | Alerting &     |
                                                                           | Monitoring     |
                                                                           +----------------+

Components

Log Collector (Agent)

Fluentd, Logstash, or custom agent

Collect logs from applications and forward them reliably

Message Queue

Apache Kafka or RabbitMQ

Buffer and transport logs asynchronously to storage

Log Storage

Elasticsearch or similar search engine

Store, index, and allow fast search of logs

Alerting & Monitoring

Prometheus + Alertmanager or custom system

Monitor logs for critical errors and send alerts

Retention Manager

Custom scripts or lifecycle policies

Manage log retention and archiving to cheaper storage

Request Flow

1. 1. Application services generate logs with structured format and levels.

2. 2. Log collectors running on servers capture logs and batch them.

3. 3. Logs are sent asynchronously to a message queue to handle bursts.

4. 4. Consumers read logs from the queue and index them into Elasticsearch.

5. 5. Users query Elasticsearch to search and filter logs by criteria.

6. 6. Alerting system monitors logs for error patterns and triggers notifications.

7. 7. Retention manager deletes or archives logs older than retention period.

Database Schema

Entities: LogEntry(id, timestamp, service_name, log_level, message, metadata_json), Service(id, name, owner), Alert(id, condition, severity, status). Relationships: LogEntry linked to Service by service_name; Alerts configured on LogEntry patterns.

Scaling Discussion

Bottlenecks

Message queue throughput limits under high log volume

Storage capacity and indexing speed in Elasticsearch

Search query latency with large datasets

Alerting system overload with many error events

Solutions

Partition message queue topics and scale consumers horizontally

Use tiered storage: hot nodes for recent logs, cold storage for older logs

Implement query caching and limit query scope for faster responses

Use sampling and rate limiting in alerting to reduce noise

Interview Tips

Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Emphasize reliability and durability of log collection

Discuss asynchronous log transport to handle bursts

Explain indexing and search for fast log retrieval

Highlight alerting importance for operational monitoring

Address scaling challenges and mitigation strategies