HLDsystem_design~10 mins

Social graph storage in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - Social graph storage

Growth Table: Social Graph Storage

Scale	Users	Connections (Edges)	Storage Needs	Query Load	System Changes
Small	100 users	~10K edges	Few MBs	Low QPS (10s)	Single DB instance, simple graph model
Medium	10K users	~1M edges	GBs	Hundreds QPS	Indexing, caching, read replicas
Large	1M users	~100M edges	100s GBs to TBs	Thousands QPS	Sharding, graph DB or specialized storage, distributed cache
Very Large	100M users	~10B edges	Multiple TBs to PBs	Hundreds of thousands QPS	Multi-region clusters, advanced partitioning, CDN for metadata, asynchronous processing

First Bottleneck

At small scale, the database handles all queries easily. As users grow to millions, the database storage and query performance become the first bottleneck. This is because social graphs have many connections per user, causing large, complex queries that slow down the DB. Also, single DB instances cannot handle the volume of reads/writes.

Scaling Solutions

Horizontal scaling: Add more database servers and shard data by user ID or graph partition to distribute load.
Caching: Use in-memory caches (e.g., Redis) for frequent queries like friend lists to reduce DB hits.
Graph databases: Use specialized graph DBs (e.g., Neo4j, JanusGraph) optimized for relationship queries.
Read replicas: Separate read and write traffic to improve throughput.
Asynchronous processing: For heavy computations (e.g., recommendations), use background jobs to avoid blocking user queries.
CDN and edge caching: Cache user profile metadata near users to reduce latency.

Back-of-Envelope Cost Analysis

Assuming 1M users with average 100 connections each:

Edges: 100M connections
Storage: If each edge record is ~100 bytes, total ~10GB just for edges (excluding indexes and metadata)
Requests per second (QPS): For 1M users, assume 0.01 QPS per user -> 10K QPS total
Bandwidth: If each query returns ~1KB, 10K QPS -> ~10MB/s bandwidth
Server capacity: One DB instance handles ~5K QPS, so at 10K QPS need at least 2 DB servers or read replicas

Interview Tip

Start by defining the scale and data model. Then identify the bottleneck (usually DB). Discuss how to partition data and use caching. Mention trade-offs between consistency and availability. Finally, explain how to handle read/write loads separately and use asynchronous processing for heavy tasks.

Self Check Question

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas to distribute read traffic and reduce load on the primary DB. Also, implement caching for frequent queries to reduce DB hits. If writes grow, consider sharding data to multiple DB instances.

Key Result

Social graph storage scales from simple single DB setups at small user counts to complex sharded, cached, and distributed graph databases at large scale, with the database becoming the first bottleneck as user connections grow exponentially.