Bird
Raised Fist0
LLDsystem_design~10 mins

Event-driven design in LLD - Scalability & System Analysis

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Scalability Analysis - Event-driven design
Growth Table: Event-driven Design Scaling
Users / Events100 users10K users1M users100M users
Event Volume~1K events/sec~100K events/sec~10M events/sec~1B events/sec
Event Broker LoadSingle broker instanceCluster of brokersMulti-region broker clustersGlobal distributed brokers with partitioning
Consumer InstancesFew consumers per serviceScaled consumers with load balancingAuto-scaling consumers with partition assignmentThousands of consumers with sharding and geo-distribution
Data StorageLocal or small DBPartitioned DB or NoSQLSharded DB clusters or distributed storageMulti-cloud distributed storage with archiving
LatencyLow (ms)Low to moderate (ms to 10s ms)Moderate (10s ms to 100s ms)Higher latency due to geo-distribution (100s ms)
First Bottleneck

At small scale, the event broker (message queue) is the first bottleneck because a single broker instance can handle only a limited number of events per second (around 10K-100K). As event volume grows, broker CPU, memory, and network bandwidth limits are reached first.

Scaling Solutions
  • Horizontal Scaling: Add more broker instances forming a cluster to distribute event load.
  • Partitioning: Split event streams into partitions so consumers can process in parallel.
  • Consumer Scaling: Increase number of consumer instances with load balancing and partition assignment.
  • Caching: Use caches for frequently accessed event data to reduce storage load.
  • Geo-distribution: Deploy brokers and consumers in multiple regions to reduce latency and increase availability.
  • Backpressure and Rate Limiting: Control event production rate to avoid overwhelming the system.
Back-of-Envelope Cost Analysis

For 10K users generating ~100K events/sec:

  • Broker cluster needs to handle 100K events/sec, requiring multiple nodes (each ~20-50K events/sec capacity).
  • Consumers must scale to process 100K events/sec, possibly 10-20 instances depending on processing time.
  • Storage needs depend on event size; for 1KB events, 100K events/sec = ~100MB/sec = ~8.6TB/day.
  • Network bandwidth must support event ingress and egress; 1 Gbps link supports ~125MB/sec, so multiple links or cloud bandwidth needed.
Interview Tip

Structure your scalability discussion by first identifying the event volume growth, then pinpoint the bottleneck (usually the event broker). Next, explain how to scale horizontally with clusters and partitions, scale consumers, and manage data storage. Mention latency and geo-distribution considerations. Always justify why each step is needed based on system limits.

Self Check

Your event broker handles 1,000 events per second. Traffic grows 10x to 10,000 events per second. What do you do first?

Answer: Add more broker instances to form a cluster and partition the event streams to distribute load. This prevents the single broker from becoming a bottleneck and allows consumers to scale processing in parallel.

Key Result
Event-driven design scales by clustering and partitioning event brokers and scaling consumers horizontally. The first bottleneck is the event broker's capacity, fixed by adding broker nodes and partitions.

Practice

(1/5)
1. What is the main purpose of event-driven design in system architecture?
easy
A. To allow systems to react to actions as they happen asynchronously
B. To process all tasks sequentially in a fixed order
C. To store data permanently in a database
D. To create static web pages without user interaction

Solution

  1. Step 1: Understand event-driven design concept

    Event-driven design focuses on reacting to events or actions as they occur, rather than processing everything in a fixed sequence.
  2. Step 2: Compare options with concept

    To allow systems to react to actions as they happen asynchronously matches this idea by describing asynchronous reaction to actions. Other options describe unrelated concepts like sequential processing, data storage, or static content.
  3. Final Answer:

    To allow systems to react to actions as they happen asynchronously -> Option A
  4. Quick Check:

    Event-driven design = react asynchronously [OK]
Hint: Event-driven means reacting to events as they happen [OK]
Common Mistakes:
  • Confusing event-driven with sequential processing
  • Thinking event-driven is about data storage
  • Assuming event-driven means static content
2. Which of the following is the correct sequence in an event-driven system?
easy
A. Consumer -> Producer -> Queue
B. Producer -> Consumer -> Queue
C. Queue -> Producer -> Consumer
D. Producer -> Queue -> Consumer

Solution

  1. Step 1: Identify roles in event-driven flow

    Producers create events, queues hold events, and consumers process events.
  2. Step 2: Arrange correct order

    The correct order is Producer sends event to Queue, then Consumer reads from Queue.
  3. Final Answer:

    Producer -> Queue -> Consumer -> Option D
  4. Quick Check:

    Producer creates, Queue holds, Consumer processes [OK]
Hint: Events flow: Producer to Queue to Consumer [OK]
Common Mistakes:
  • Mixing up producer and consumer order
  • Placing queue after consumer
  • Ignoring the queue role
3. Consider this simplified event-driven code snippet:
event_queue = []

def produce(event):
    event_queue.append(event)

def consume():
    if event_queue:
        return event_queue.pop(0)
    return None

produce('A')
produce('B')
print(consume())
print(consume())
print(consume())

What is the output?
medium
A. None None None
B. B A None
C. A B None
D. A None B

Solution

  1. Step 1: Trace event production

    Two events 'A' and 'B' are added to the queue in order: ['A', 'B'].
  2. Step 2: Trace event consumption

    consume() removes and returns the first event: first 'A', then 'B', then None when empty.
  3. Final Answer:

    A B None -> Option C
  4. Quick Check:

    FIFO queue returns A then B then None [OK]
Hint: Queue pops first-in event first (FIFO) [OK]
Common Mistakes:
  • Assuming LIFO instead of FIFO
  • Forgetting to check empty queue
  • Mixing order of events
4. In an event-driven system, a developer wrote this code snippet:
def consume(event_queue):
    event = event_queue.pop()
    process(event)

What is the main issue with this code?
medium
A. It does not check if the queue is empty before popping
B. It adds events instead of removing them
C. It uses an undefined function 'process'
D. It processes events in reverse order, not FIFO

Solution

  1. Step 1: Analyze pop usage without check

    pop() removes last item but no check if queue is empty, risking error.
  2. Step 2: Identify error risk

    Calling pop() on empty list causes runtime error; code lacks safety check.
  3. Final Answer:

    It does not check if the queue is empty before popping -> Option A
  4. Quick Check:

    pop() on empty list causes error [OK]
Hint: Always check queue not empty before pop() [OK]
Common Mistakes:
  • Ignoring empty queue check
  • Confusing pop() order with error
  • Assuming process() is undefined error
5. You are designing a scalable event-driven system for a social media app. Which approach best improves scalability and fault tolerance?
hard
A. Store all events in a database and process them synchronously
B. Use a distributed message queue with multiple consumers processing events in parallel
C. Use a single queue and one consumer to ensure event order
D. Send events directly from producer to consumer without queue

Solution

  1. Step 1: Understand scalability and fault tolerance needs

    Social media apps have high event volume; parallel processing and fault tolerance are key.
  2. Step 2: Evaluate options for scalability

    Distributed queues with multiple consumers allow load balancing and fault tolerance. Single consumer limits throughput. Synchronous processing blocks system. Direct send lacks buffering and fault tolerance.
  3. Final Answer:

    Use a distributed message queue with multiple consumers processing events in parallel -> Option B
  4. Quick Check:

    Distributed queues + parallel consumers = scalable & fault tolerant [OK]
Hint: Parallel consumers on distributed queue scale best [OK]
Common Mistakes:
  • Choosing single consumer limits throughput
  • Ignoring asynchronous processing benefits
  • Skipping queue leads to lost events