HLDsystem_design~25 mins

Dead letter queues in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Dead Letter Queue System

Design focuses on the dead letter queue mechanism integrated with a message queue system. Out of scope are the detailed implementations of the primary message processing logic and UI for monitoring.

Functional Requirements

FR1: Capture messages that cannot be processed successfully after multiple retries

FR2: Store failed messages separately for later inspection or reprocessing

FR3: Support configurable retry limits before moving messages to dead letter queue

FR4: Provide monitoring and alerting for dead letter queue size and growth

FR5: Allow manual or automated reprocessing of messages from dead letter queue

FR6: Ensure message order is preserved where applicable

FR7: Integrate with existing message queue systems (e.g., RabbitMQ, Kafka, AWS SQS)

Non-Functional Requirements

NFR1: Handle up to 100,000 messages per minute

NFR2: Retry attempts must not exceed 5 per message

NFR3: Dead letter queue must be highly available with 99.9% uptime

NFR4: Latency for normal message processing should remain under 200ms

NFR5: System must support message retention in dead letter queue for at least 7 days

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

❓ Question 7

Key Components

Primary message queue system

Retry mechanism with counters

Dead letter queue storage

Monitoring and alerting service

Reprocessing service or API

Message metadata tracking

Design Patterns

Poison message handling

Retry with exponential backoff

Circuit breaker pattern for failing consumers

Event sourcing for message state tracking

Idempotent message processing

Reference Architecture

Client
  |
  v
Primary Message Queue ---> Message Consumer ---> Processing Logic
                               |
                               |-- On failure and retries exhausted --> Dead Letter Queue Storage
                               |
                               |-- Monitoring & Alerting Service

Dead Letter Queue Storage <--> Reprocessing Service/API

Components

Primary Message Queue

RabbitMQ / Kafka / AWS SQS

Handles normal message delivery and processing

Message Consumer

Microservice or Worker

Consumes messages, processes them, and tracks retry attempts

Retry Mechanism

In-memory or persistent counters with exponential backoff

Retries message processing up to configured limit before failure

Dead Letter Queue Storage

Durable queue or database (e.g., Kafka topic, SQS DLQ, or NoSQL DB)

Stores failed messages for later inspection or reprocessing

Monitoring and Alerting Service

Prometheus + Grafana or CloudWatch

Tracks DLQ size, growth rate, and triggers alerts

Reprocessing Service/API

Microservice with API

Allows manual or automated reprocessing of dead letter messages

Request Flow

1. Client sends message to Primary Message Queue

2. Message Consumer receives message and attempts processing

3. If processing fails, retry counter increments and message is retried with backoff

4. After max retries (5), message is moved to Dead Letter Queue Storage

5. Monitoring service observes DLQ metrics and alerts if thresholds exceeded

6. Operators or automated jobs can trigger Reprocessing Service to reprocess DLQ messages

7. Successfully reprocessed messages are removed from Dead Letter Queue

Database Schema

Entities: - Message: id (PK), payload, metadata, retry_count, status (pending, processing, failed, dead_letter) - DeadLetterMessage: id (PK), original_message_id (FK), payload, failure_reason, timestamp, retry_count Relationships: - One-to-one from DeadLetterMessage to Message for traceability - Retry count tracked per Message entity

Scaling Discussion

Bottlenecks

High volume of failed messages causing DLQ storage to grow rapidly

Message Consumer overwhelmed by retry attempts causing processing delays

Monitoring system unable to keep up with DLQ metrics at scale

Reprocessing service bottlenecked by large DLQ size

Solutions

Implement DLQ partitioning and archiving to manage storage size

Use distributed consumers with rate limiting and backoff to handle retries efficiently

Scale monitoring infrastructure horizontally and use sampling for metrics

Batch reprocessing with parallel workers and prioritize messages by failure reason

Interview Tips

Time: Spend 10 minutes understanding requirements and clarifying retry and failure policies, 15 minutes designing the architecture and data flow, 10 minutes discussing scaling and bottlenecks, 10 minutes for Q&A and trade-offs.

Explain the importance of handling poison messages to avoid system blockage

Describe retry strategies and how to track retry counts per message

Justify choice of durable storage for dead letter queue to ensure reliability

Discuss monitoring and alerting to proactively detect issues

Highlight reprocessing capabilities for operational flexibility

Address scalability challenges and solutions clearly