HLDsystem_design~10 mins

Design a key-value store in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - Design a key-value store

Growth Table: Key-Value Store Scaling

Users / Requests	100 Users	10K Users	1M Users	100M Users
Requests per second (QPS)	~100	~10,000	~1,000,000	~100,000,000
Data size	~10 GB	~1 TB	~100 TB	~10 PB
Number of servers	1-2	10-20	100-200	10,000+
Latency	<1 ms	<5 ms	<10 ms	<20 ms
Storage type	SSD local	Distributed SSD	Sharded distributed storage	Multi-region distributed storage
Replication	Simple master-slave	Multi-replica for availability	Geo-replication	Global replication with consistency

First Bottleneck

At small scale (100 users), the database server CPU and disk I/O are the first bottlenecks because a single server handles all requests and data.

At medium scale (10K to 1M users), the database query throughput and network bandwidth become bottlenecks as requests increase beyond a single server's capacity.

At large scale (100M users), data partitioning and cross-region replication latency become bottlenecks due to massive data size and global distribution.

Scaling Solutions

Vertical scaling: Upgrade server CPU, RAM, and SSDs for small scale.
Horizontal scaling: Add more servers behind a load balancer to distribute requests.
Sharding: Split data by key ranges or hash to distribute storage and load across servers.
Caching: Use in-memory caches (e.g., Redis) to reduce database load for frequent reads.
Replication: Use master-slave or multi-master replication for availability and read scaling.
Consistent hashing: To minimize data movement when scaling out or in.
Geo-distribution: Deploy data centers closer to users to reduce latency at large scale.

Back-of-Envelope Cost Analysis

At 1M QPS, assuming 1KB per request, bandwidth needed is ~1 GB/s (8 Gbps).
Storage for 100 TB data requires multiple SSD servers; each SSD ~4 TB, so ~25 servers minimum.
Each server handles ~5,000 QPS; for 1M QPS, need ~200 servers.
Network infrastructure must support high throughput and low latency.
Replication doubles storage and bandwidth needs.

Interview Tip

Start by clarifying requirements: data size, read/write ratio, latency needs.

Discuss simple design first, then identify bottlenecks as scale grows.

Explain how each scaling solution addresses specific bottlenecks.

Use real numbers to justify design choices and trade-offs.

Self Check Question

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas and implement caching to reduce load on the primary database before scaling vertically or sharding.

Key Result

A key-value store scales by starting with vertical scaling and simple replication, then moves to horizontal scaling with sharding and caching as traffic and data grow, addressing bottlenecks in database throughput, storage, and network bandwidth.