| Users/Events | 100 Users | 10,000 Users | 1,000,000 Users | 100,000,000 Users |
|---|---|---|---|---|
| Event Volume | ~10 events/min | ~1,000 events/min | ~100,000 events/min | ~10,000,000 events/min |
| System Components | Single server, simple alerting | Multiple servers, basic load balancing | Distributed servers, advanced routing | Global distributed system, multi-region failover |
| Database Load | Low, single instance | Moderate, read replicas | High, sharded database | Very high, multi-shard, geo-distributed DB |
| Alerting Latency | Seconds | Seconds to sub-second | Sub-second | Milliseconds |
| Storage Needs | GBs | TBs | Petabytes | Exabytes |
| Network Bandwidth | Low | Moderate | High | Very High |
Emergency handling in LLD - Scalability & System Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
The database is the first bottleneck as event volume grows. Emergency handling systems require fast writes and reads for alerts and logs. At around 10,000 users generating thousands of events per minute, a single database instance struggles with write throughput and query latency.
- Horizontal Scaling: Add more application servers behind load balancers to handle increased event processing.
- Database Read Replicas: Use replicas to offload read queries and reduce latency.
- Sharding: Partition the database by event type or region to distribute load.
- Caching: Cache frequent queries and alert statuses in fast in-memory stores like Redis.
- Message Queues: Use queues to buffer incoming events and smooth spikes in traffic.
- CDN and Edge Computing: For alert delivery (e.g., notifications), use CDNs and edge nodes to reduce latency globally.
- Multi-region Deployment: Deploy system components in multiple regions for fault tolerance and disaster recovery.
- At 10,000 users generating ~1,000 events/min (~17 events/sec), the system needs to handle ~17 writes/sec plus reads.
- Database write capacity: A single PostgreSQL instance can handle ~5,000 QPS, so write load is manageable initially.
- Storage: Assuming 1 KB per event, 1,000 events/min = ~1.4 MB/min = ~2 GB/month.
- Network bandwidth: 1,000 events/min * 1 KB = ~17 KB/sec, very low at this scale.
- At 1 million users (~100,000 events/min), write load is ~1,666 QPS, requiring sharded DB and caching.
- Bandwidth and storage scale accordingly, requiring distributed storage and efficient data retention policies.
Start by clarifying the expected event volume and latency requirements. Discuss the data flow from event ingestion to alerting. Identify the database as the likely bottleneck early. Propose incremental scaling steps: caching, read replicas, sharding, and multi-region deployment. Emphasize fault tolerance and disaster recovery in emergency systems.
Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Add read replicas to offload read queries and implement caching to reduce database load. If writes are the bottleneck, consider sharding the database to distribute write load across multiple instances.
Practice
Solution
Step 1: Understand the purpose of emergency handling
Emergency handling systems are designed to detect issues fast and act to prevent harm.Step 2: Identify the main goal
The main goal is to protect people and property by quick detection and response.Final Answer:
To detect problems quickly and protect people and property -> Option AQuick Check:
Emergency handling = fast detection and protection [OK]
- Confusing emergency handling with performance optimization
- Thinking it is about cost reduction
- Assuming it is for marketing analytics
Solution
Step 1: List typical components
Emergency handling systems usually have detection, alerting, safety actions, and logging.Step 2: Identify the unrelated component
User interface for marketing is unrelated to emergency handling functions.Final Answer:
User interface for marketing -> Option DQuick Check:
Marketing UI ≠ emergency handling component [OK]
- Including unrelated business components
- Confusing alerting with marketing notifications
- Ignoring safety action controllers
if sensor.detect(): alert.send(); safety.activate(); log.record()What happens if
sensor.detect() returns false?Solution
Step 1: Analyze the if condition
The actions alert.send(), safety.activate(), and log.record() run only if sensor.detect() is true.Step 2: Determine behavior when sensor.detect() is false
If sensor.detect() returns false, the code block inside if does not run, so no actions execute.Final Answer:
No actions execute -> Option CQuick Check:
False detection = no emergency actions [OK]
- Assuming log always runs regardless of detection
- Thinking alert or safety run without detection
- Confusing else behavior when none is given
if sensor.detect():
alert.send()
safety.activate()
log.record()What is the main issue?
Solution
Step 1: Check code indentation
log.record() is not indented under the if, so it runs always.Step 2: Understand impact
log.record() runs even when sensor.detect() is false, which is incorrect behavior.Final Answer:
Missing indentation causes log.record() to run always -> Option AQuick Check:
Indentation controls conditional execution [OK]
- Ignoring indentation importance
- Assuming all lines are inside if by default
- Confusing which lines run conditionally
Solution
Step 1: Understand reliability needs
To ensure alerts reach multiple teams, sending in parallel avoids blocking on one failure.Step 2: Use retries and fallback logging
Retries help recover from temporary failures; fallback logging records failures for later review.Final Answer:
Send alerts in parallel with retries and fallback logging -> Option BQuick Check:
Parallel + retries = reliable alerting [OK]
- Stopping alerts on first failure
- Ignoring retries and fallback mechanisms
- Reducing alert recipients to simplify
