Bird
Raised Fist0
HLDsystem_design~10 mins

Social graph storage in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Social graph storage
Growth Table: Social Graph Storage
ScaleUsersConnections (Edges)Storage NeedsQuery LoadSystem Changes
Small100 users~10K edgesFew MBsLow QPS (10s)Single DB instance, simple graph model
Medium10K users~1M edgesGBsHundreds QPSIndexing, caching, read replicas
Large1M users~100M edges100s GBs to TBsThousands QPSSharding, graph DB or specialized storage, distributed cache
Very Large100M users~10B edgesMultiple TBs to PBsHundreds of thousands QPSMulti-region clusters, advanced partitioning, CDN for metadata, asynchronous processing
First Bottleneck

At small scale, the database handles all queries easily. As users grow to millions, the database storage and query performance become the first bottleneck. This is because social graphs have many connections per user, causing large, complex queries that slow down the DB. Also, single DB instances cannot handle the volume of reads/writes.

Scaling Solutions
  • Horizontal scaling: Add more database servers and shard data by user ID or graph partition to distribute load.
  • Caching: Use in-memory caches (e.g., Redis) for frequent queries like friend lists to reduce DB hits.
  • Graph databases: Use specialized graph DBs (e.g., Neo4j, JanusGraph) optimized for relationship queries.
  • Read replicas: Separate read and write traffic to improve throughput.
  • Asynchronous processing: For heavy computations (e.g., recommendations), use background jobs to avoid blocking user queries.
  • CDN and edge caching: Cache user profile metadata near users to reduce latency.
Back-of-Envelope Cost Analysis

Assuming 1M users with average 100 connections each:

  • Edges: 100M connections
  • Storage: If each edge record is ~100 bytes, total ~10GB just for edges (excluding indexes and metadata)
  • Requests per second (QPS): For 1M users, assume 0.01 QPS per user -> 10K QPS total
  • Bandwidth: If each query returns ~1KB, 10K QPS -> ~10MB/s bandwidth
  • Server capacity: One DB instance handles ~5K QPS, so at 10K QPS need at least 2 DB servers or read replicas
Interview Tip

Start by defining the scale and data model. Then identify the bottleneck (usually DB). Discuss how to partition data and use caching. Mention trade-offs between consistency and availability. Finally, explain how to handle read/write loads separately and use asynchronous processing for heavy tasks.

Self Check Question

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas to distribute read traffic and reduce load on the primary DB. Also, implement caching for frequent queries to reduce DB hits. If writes grow, consider sharding data to multiple DB instances.

Key Result
Social graph storage scales from simple single DB setups at small user counts to complex sharded, cached, and distributed graph databases at large scale, with the database becoming the first bottleneck as user connections grow exponentially.