Bird
Raised Fist0
HLDsystem_design~10 mins

Design a key-value store in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Design a key-value store
Growth Table: Key-Value Store Scaling
Users / Requests100 Users10K Users1M Users100M Users
Requests per second (QPS)~100~10,000~1,000,000~100,000,000
Data size~10 GB~1 TB~100 TB~10 PB
Number of servers1-210-20100-20010,000+
Latency<1 ms<5 ms<10 ms<20 ms
Storage typeSSD localDistributed SSDSharded distributed storageMulti-region distributed storage
ReplicationSimple master-slaveMulti-replica for availabilityGeo-replicationGlobal replication with consistency
First Bottleneck

At small scale (100 users), the database server CPU and disk I/O are the first bottlenecks because a single server handles all requests and data.

At medium scale (10K to 1M users), the database query throughput and network bandwidth become bottlenecks as requests increase beyond a single server's capacity.

At large scale (100M users), data partitioning and cross-region replication latency become bottlenecks due to massive data size and global distribution.

Scaling Solutions
  • Vertical scaling: Upgrade server CPU, RAM, and SSDs for small scale.
  • Horizontal scaling: Add more servers behind a load balancer to distribute requests.
  • Sharding: Split data by key ranges or hash to distribute storage and load across servers.
  • Caching: Use in-memory caches (e.g., Redis) to reduce database load for frequent reads.
  • Replication: Use master-slave or multi-master replication for availability and read scaling.
  • Consistent hashing: To minimize data movement when scaling out or in.
  • Geo-distribution: Deploy data centers closer to users to reduce latency at large scale.
Back-of-Envelope Cost Analysis
  • At 1M QPS, assuming 1KB per request, bandwidth needed is ~1 GB/s (8 Gbps).
  • Storage for 100 TB data requires multiple SSD servers; each SSD ~4 TB, so ~25 servers minimum.
  • Each server handles ~5,000 QPS; for 1M QPS, need ~200 servers.
  • Network infrastructure must support high throughput and low latency.
  • Replication doubles storage and bandwidth needs.
Interview Tip

Start by clarifying requirements: data size, read/write ratio, latency needs.

Discuss simple design first, then identify bottlenecks as scale grows.

Explain how each scaling solution addresses specific bottlenecks.

Use real numbers to justify design choices and trade-offs.

Self Check Question

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas and implement caching to reduce load on the primary database before scaling vertically or sharding.

Key Result
A key-value store scales by starting with vertical scaling and simple replication, then moves to horizontal scaling with sharding and caching as traffic and data grow, addressing bottlenecks in database throughput, storage, and network bandwidth.