| Scale | Users | Connections (Edges) | Storage Needs | Query Load | System Changes |
|---|---|---|---|---|---|
| Small | 100 users | ~10K edges | Few MBs | Low QPS (10s) | Single DB instance, simple graph model |
| Medium | 10K users | ~1M edges | GBs | Hundreds QPS | Indexing, caching, read replicas |
| Large | 1M users | ~100M edges | 100s GBs to TBs | Thousands QPS | Sharding, graph DB or specialized storage, distributed cache |
| Very Large | 100M users | ~10B edges | Multiple TBs to PBs | Hundreds of thousands QPS | Multi-region clusters, advanced partitioning, CDN for metadata, asynchronous processing |
Social graph storage in HLD - Scalability & System Analysis
At small scale, the database handles all queries easily. As users grow to millions, the database storage and query performance become the first bottleneck. This is because social graphs have many connections per user, causing large, complex queries that slow down the DB. Also, single DB instances cannot handle the volume of reads/writes.
- Horizontal scaling: Add more database servers and shard data by user ID or graph partition to distribute load.
- Caching: Use in-memory caches (e.g., Redis) for frequent queries like friend lists to reduce DB hits.
- Graph databases: Use specialized graph DBs (e.g., Neo4j, JanusGraph) optimized for relationship queries.
- Read replicas: Separate read and write traffic to improve throughput.
- Asynchronous processing: For heavy computations (e.g., recommendations), use background jobs to avoid blocking user queries.
- CDN and edge caching: Cache user profile metadata near users to reduce latency.
Assuming 1M users with average 100 connections each:
- Edges: 100M connections
- Storage: If each edge record is ~100 bytes, total ~10GB just for edges (excluding indexes and metadata)
- Requests per second (QPS): For 1M users, assume 0.01 QPS per user -> 10K QPS total
- Bandwidth: If each query returns ~1KB, 10K QPS -> ~10MB/s bandwidth
- Server capacity: One DB instance handles ~5K QPS, so at 10K QPS need at least 2 DB servers or read replicas
Start by defining the scale and data model. Then identify the bottleneck (usually DB). Discuss how to partition data and use caching. Mention trade-offs between consistency and availability. Finally, explain how to handle read/write loads separately and use asynchronous processing for heavy tasks.
Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Add read replicas to distribute read traffic and reduce load on the primary DB. Also, implement caching for frequent queries to reduce DB hits. If writes grow, consider sharding data to multiple DB instances.
