Bird
Raised Fist0
HLDsystem_design~25 mins

Social graph storage in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Social Graph Storage System
Design focuses on storage and query of social graph connections. User authentication, UI, and recommendation algorithms are out of scope.
Functional Requirements
FR1: Store user profiles and their connections (friends, followers).
FR2: Support adding and removing connections between users.
FR3: Efficiently query direct connections of a user.
FR4: Support querying mutual connections between two users.
FR5: Handle up to 100 million users and 10 billion connections.
FR6: Provide low latency queries (p99 < 100ms) for connection lookups.
FR7: Ensure data consistency for connection updates.
FR8: Support eventual consistency for large-scale replication.
Non-Functional Requirements
NFR1: Scale to 100 million users and 10 billion edges.
NFR2: API response latency p99 under 100ms for read queries.
NFR3: Availability target 99.9% uptime (about 8.77 hours downtime/year).
NFR4: Data storage must be cost-effective and scalable.
NFR5: Support horizontal scaling for both reads and writes.
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
❓ Question 7
Key Components
Graph database or specialized graph storage system
Caching layer for frequent queries
API layer for connection management and queries
Replication and backup system
Load balancer for distributing requests
Monitoring and alerting system
Design Patterns
Graph data modeling (adjacency list, edge list)
Sharding and partitioning of graph data
Caching strategies for graph queries
Eventual consistency with conflict resolution
Batch processing for analytics and maintenance
Reference Architecture
Client
  |
  v
Load Balancer
  |
  v
API Servers (REST/gRPC)
  |
  v
Cache Layer (Redis or Memcached)
  |
  v
Graph Storage Cluster (e.g., Neo4j, JanusGraph with Cassandra)
  |
  v
Replication & Backup Storage

Monitoring & Alerting System (parallel)
Components
API Servers
REST/gRPC services
Handle client requests for adding/removing connections and querying social graph
Load Balancer
Nginx or cloud LB
Distribute incoming requests evenly across API servers
Cache Layer
Redis or Memcached
Cache frequent queries like user connections to reduce latency
Graph Storage Cluster
Neo4j or JanusGraph with Cassandra backend
Store user nodes and edges representing connections; support graph queries
Replication & Backup Storage
Cassandra or distributed file system
Ensure data durability and availability across data centers
Monitoring & Alerting System
Prometheus, Grafana
Track system health, latency, errors, and trigger alerts
Request Flow
1. Client sends request to add/remove connection or query connections.
2. Load balancer routes request to one of the API servers.
3. API server checks cache for query requests; if cache miss, queries graph storage.
4. For write requests, API server updates graph storage and invalidates relevant cache entries.
5. Graph storage persists changes and replicates data asynchronously.
6. API server returns response to client with connection data or confirmation.
7. Monitoring system collects metrics from all components continuously.
Database Schema
Entities: - User: user_id (PK), name, profile_data - Connection: from_user_id (FK to User), to_user_id (FK to User), connection_type, created_at Relationships: - User to Connection is 1:N (one user can have many connections) - Connections can be directed (follower) or undirected (friendship stored as two directed edges) Indexes: - Index on from_user_id for fast lookup of user's connections - Composite index on (from_user_id, to_user_id) for quick existence checks Storage: - Use adjacency list model storing edges per user node for efficient traversal
Scaling Discussion
Bottlenecks
Graph storage cluster can become slow with very large number of edges per user.
Cache invalidation complexity increases with frequent writes.
Load balancer and API servers can become overwhelmed with high request volume.
Replication lag can cause stale reads in multi-region setups.
Solutions
Shard graph data by user_id ranges or hash to distribute load across multiple storage nodes.
Use write-back cache with TTL and selective invalidation to balance freshness and performance.
Auto-scale API servers and load balancers based on traffic patterns.
Implement multi-master replication with conflict resolution or use read-your-writes consistency models.
Interview Tips
Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data model, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.
Clarify types of connections and query patterns before designing.
Explain choice of graph database and caching for performance.
Discuss data model focusing on adjacency list for efficient queries.
Highlight how to handle scale with sharding and replication.
Mention trade-offs between consistency and availability.
Show awareness of monitoring and operational concerns.