0
0
HLDsystem_design~25 mins

Dead letter queues in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Dead Letter Queue System
Design focuses on the dead letter queue mechanism integrated with a message queue system. Out of scope are the detailed implementations of the primary message processing logic and UI for monitoring.
Functional Requirements
FR1: Capture messages that cannot be processed successfully after multiple retries
FR2: Store failed messages separately for later inspection or reprocessing
FR3: Support configurable retry limits before moving messages to dead letter queue
FR4: Provide monitoring and alerting for dead letter queue size and growth
FR5: Allow manual or automated reprocessing of messages from dead letter queue
FR6: Ensure message order is preserved where applicable
FR7: Integrate with existing message queue systems (e.g., RabbitMQ, Kafka, AWS SQS)
Non-Functional Requirements
NFR1: Handle up to 100,000 messages per minute
NFR2: Retry attempts must not exceed 5 per message
NFR3: Dead letter queue must be highly available with 99.9% uptime
NFR4: Latency for normal message processing should remain under 200ms
NFR5: System must support message retention in dead letter queue for at least 7 days
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
❓ Question 7
Key Components
Primary message queue system
Retry mechanism with counters
Dead letter queue storage
Monitoring and alerting service
Reprocessing service or API
Message metadata tracking
Design Patterns
Poison message handling
Retry with exponential backoff
Circuit breaker pattern for failing consumers
Event sourcing for message state tracking
Idempotent message processing
Reference Architecture
Client
  |
  v
Primary Message Queue ---> Message Consumer ---> Processing Logic
                               |
                               |-- On failure and retries exhausted --> Dead Letter Queue Storage
                               |
                               |-- Monitoring & Alerting Service

Dead Letter Queue Storage <--> Reprocessing Service/API

Components
Primary Message Queue
RabbitMQ / Kafka / AWS SQS
Handles normal message delivery and processing
Message Consumer
Microservice or Worker
Consumes messages, processes them, and tracks retry attempts
Retry Mechanism
In-memory or persistent counters with exponential backoff
Retries message processing up to configured limit before failure
Dead Letter Queue Storage
Durable queue or database (e.g., Kafka topic, SQS DLQ, or NoSQL DB)
Stores failed messages for later inspection or reprocessing
Monitoring and Alerting Service
Prometheus + Grafana or CloudWatch
Tracks DLQ size, growth rate, and triggers alerts
Reprocessing Service/API
Microservice with API
Allows manual or automated reprocessing of dead letter messages
Request Flow
1. Client sends message to Primary Message Queue
2. Message Consumer receives message and attempts processing
3. If processing fails, retry counter increments and message is retried with backoff
4. After max retries (5), message is moved to Dead Letter Queue Storage
5. Monitoring service observes DLQ metrics and alerts if thresholds exceeded
6. Operators or automated jobs can trigger Reprocessing Service to reprocess DLQ messages
7. Successfully reprocessed messages are removed from Dead Letter Queue
Database Schema
Entities: - Message: id (PK), payload, metadata, retry_count, status (pending, processing, failed, dead_letter) - DeadLetterMessage: id (PK), original_message_id (FK), payload, failure_reason, timestamp, retry_count Relationships: - One-to-one from DeadLetterMessage to Message for traceability - Retry count tracked per Message entity
Scaling Discussion
Bottlenecks
High volume of failed messages causing DLQ storage to grow rapidly
Message Consumer overwhelmed by retry attempts causing processing delays
Monitoring system unable to keep up with DLQ metrics at scale
Reprocessing service bottlenecked by large DLQ size
Solutions
Implement DLQ partitioning and archiving to manage storage size
Use distributed consumers with rate limiting and backoff to handle retries efficiently
Scale monitoring infrastructure horizontally and use sampling for metrics
Batch reprocessing with parallel workers and prioritize messages by failure reason
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying retry and failure policies, 15 minutes designing the architecture and data flow, 10 minutes discussing scaling and bottlenecks, 10 minutes for Q&A and trade-offs.
Explain the importance of handling poison messages to avoid system blockage
Describe retry strategies and how to track retry counts per message
Justify choice of durable storage for dead letter queue to ensure reliability
Discuss monitoring and alerting to proactively detect issues
Highlight reprocessing capabilities for operational flexibility
Address scalability challenges and solutions clearly