Bird
Raised Fist0
LLDsystem_design~10 mins

Emergency handling in LLD - Scalability & System Analysis

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Scalability Analysis - Emergency handling
Growth Table: Emergency Handling System
Users/Events100 Users10,000 Users1,000,000 Users100,000,000 Users
Event Volume~10 events/min~1,000 events/min~100,000 events/min~10,000,000 events/min
System ComponentsSingle server, simple alertingMultiple servers, basic load balancingDistributed servers, advanced routingGlobal distributed system, multi-region failover
Database LoadLow, single instanceModerate, read replicasHigh, sharded databaseVery high, multi-shard, geo-distributed DB
Alerting LatencySecondsSeconds to sub-secondSub-secondMilliseconds
Storage NeedsGBsTBsPetabytesExabytes
Network BandwidthLowModerateHighVery High
First Bottleneck

The database is the first bottleneck as event volume grows. Emergency handling systems require fast writes and reads for alerts and logs. At around 10,000 users generating thousands of events per minute, a single database instance struggles with write throughput and query latency.

Scaling Solutions
  • Horizontal Scaling: Add more application servers behind load balancers to handle increased event processing.
  • Database Read Replicas: Use replicas to offload read queries and reduce latency.
  • Sharding: Partition the database by event type or region to distribute load.
  • Caching: Cache frequent queries and alert statuses in fast in-memory stores like Redis.
  • Message Queues: Use queues to buffer incoming events and smooth spikes in traffic.
  • CDN and Edge Computing: For alert delivery (e.g., notifications), use CDNs and edge nodes to reduce latency globally.
  • Multi-region Deployment: Deploy system components in multiple regions for fault tolerance and disaster recovery.
Back-of-Envelope Cost Analysis
  • At 10,000 users generating ~1,000 events/min (~17 events/sec), the system needs to handle ~17 writes/sec plus reads.
  • Database write capacity: A single PostgreSQL instance can handle ~5,000 QPS, so write load is manageable initially.
  • Storage: Assuming 1 KB per event, 1,000 events/min = ~1.4 MB/min = ~2 GB/month.
  • Network bandwidth: 1,000 events/min * 1 KB = ~17 KB/sec, very low at this scale.
  • At 1 million users (~100,000 events/min), write load is ~1,666 QPS, requiring sharded DB and caching.
  • Bandwidth and storage scale accordingly, requiring distributed storage and efficient data retention policies.
Interview Tip

Start by clarifying the expected event volume and latency requirements. Discuss the data flow from event ingestion to alerting. Identify the database as the likely bottleneck early. Propose incremental scaling steps: caching, read replicas, sharding, and multi-region deployment. Emphasize fault tolerance and disaster recovery in emergency systems.

Self Check

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas to offload read queries and implement caching to reduce database load. If writes are the bottleneck, consider sharding the database to distribute write load across multiple instances.

Key Result
Emergency handling systems first hit database bottlenecks as event volume grows; scaling requires caching, read replicas, sharding, and multi-region deployment to maintain low latency and high availability.

Practice

(1/5)
1. What is the primary goal of an emergency handling system in system design?
easy
A. To detect problems quickly and protect people and property
B. To increase system performance under normal conditions
C. To reduce the cost of hardware components
D. To provide detailed analytics for marketing purposes

Solution

  1. Step 1: Understand the purpose of emergency handling

    Emergency handling systems are designed to detect issues fast and act to prevent harm.
  2. Step 2: Identify the main goal

    The main goal is to protect people and property by quick detection and response.
  3. Final Answer:

    To detect problems quickly and protect people and property -> Option A
  4. Quick Check:

    Emergency handling = fast detection and protection [OK]
Hint: Focus on safety and speed in emergencies [OK]
Common Mistakes:
  • Confusing emergency handling with performance optimization
  • Thinking it is about cost reduction
  • Assuming it is for marketing analytics
2. Which component is NOT typically part of an emergency handling system?
easy
A. Safety action controller
B. Alerting system
C. Detection module
D. User interface for marketing

Solution

  1. Step 1: List typical components

    Emergency handling systems usually have detection, alerting, safety actions, and logging.
  2. Step 2: Identify the unrelated component

    User interface for marketing is unrelated to emergency handling functions.
  3. Final Answer:

    User interface for marketing -> Option D
  4. Quick Check:

    Marketing UI ≠ emergency handling component [OK]
Hint: Exclude marketing from emergency system parts [OK]
Common Mistakes:
  • Including unrelated business components
  • Confusing alerting with marketing notifications
  • Ignoring safety action controllers
3. Consider this simplified emergency system flow:
if sensor.detect(): alert.send(); safety.activate(); log.record()
What happens if sensor.detect() returns false?
medium
A. Alert, safety, and log actions all execute
B. Only alert and safety actions execute
C. No actions execute
D. Only log action executes

Solution

  1. Step 1: Analyze the if condition

    The actions alert.send(), safety.activate(), and log.record() run only if sensor.detect() is true.
  2. Step 2: Determine behavior when sensor.detect() is false

    If sensor.detect() returns false, the code block inside if does not run, so no actions execute.
  3. Final Answer:

    No actions execute -> Option C
  4. Quick Check:

    False detection = no emergency actions [OK]
Hint: If condition false means skip all inside actions [OK]
Common Mistakes:
  • Assuming log always runs regardless of detection
  • Thinking alert or safety run without detection
  • Confusing else behavior when none is given
4. In an emergency system, this code snippet causes a problem:
if sensor.detect():
alert.send()
safety.activate()
log.record()

What is the main issue?
medium
A. Missing indentation causes log.record() to run always
B. safety.activate() is outside the if block
C. alert.send() is not called properly
D. log.record() runs even if no detection

Solution

  1. Step 1: Check code indentation

    log.record() is not indented under the if, so it runs always.
  2. Step 2: Understand impact

    log.record() runs even when sensor.detect() is false, which is incorrect behavior.
  3. Final Answer:

    Missing indentation causes log.record() to run always -> Option A
  4. Quick Check:

    Indentation controls conditional execution [OK]
Hint: Indent all emergency actions inside detection check [OK]
Common Mistakes:
  • Ignoring indentation importance
  • Assuming all lines are inside if by default
  • Confusing which lines run conditionally
5. You design an emergency system that must alert multiple teams and log events reliably. Which design approach best ensures alerts are sent even if one alert service fails?
hard
A. Send alerts sequentially and stop on first failure
B. Send alerts in parallel with retries and fallback logging
C. Send alerts only to the primary team to reduce complexity
D. Log events only after all alerts succeed

Solution

  1. Step 1: Understand reliability needs

    To ensure alerts reach multiple teams, sending in parallel avoids blocking on one failure.
  2. Step 2: Use retries and fallback logging

    Retries help recover from temporary failures; fallback logging records failures for later review.
  3. Final Answer:

    Send alerts in parallel with retries and fallback logging -> Option B
  4. Quick Check:

    Parallel + retries = reliable alerting [OK]
Hint: Use parallel alerts with retries for reliability [OK]
Common Mistakes:
  • Stopping alerts on first failure
  • Ignoring retries and fallback mechanisms
  • Reducing alert recipients to simplify