HLDsystem_design~25 mins

Social graph storage in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Social Graph Storage System

Design focuses on storage and query of social graph connections. User authentication, UI, and recommendation algorithms are out of scope.

Functional Requirements

FR1: Store user profiles and their connections (friends, followers).

FR2: Support adding and removing connections between users.

FR3: Efficiently query direct connections of a user.

FR4: Support querying mutual connections between two users.

FR5: Handle up to 100 million users and 10 billion connections.

FR6: Provide low latency queries (p99 < 100ms) for connection lookups.

FR7: Ensure data consistency for connection updates.

FR8: Support eventual consistency for large-scale replication.

Non-Functional Requirements

NFR1: Scale to 100 million users and 10 billion edges.

NFR2: API response latency p99 under 100ms for read queries.

NFR3: Availability target 99.9% uptime (about 8.77 hours downtime/year).

NFR4: Data storage must be cost-effective and scalable.

NFR5: Support horizontal scaling for both reads and writes.

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

❓ Question 7

Key Components

Graph database or specialized graph storage system

Caching layer for frequent queries

API layer for connection management and queries

Replication and backup system

Load balancer for distributing requests

Monitoring and alerting system

Design Patterns

Graph data modeling (adjacency list, edge list)

Sharding and partitioning of graph data

Caching strategies for graph queries

Eventual consistency with conflict resolution

Batch processing for analytics and maintenance

Reference Architecture

Client
  |
  v
Load Balancer
  |
  v
API Servers (REST/gRPC)
  |
  v
Cache Layer (Redis or Memcached)
  |
  v
Graph Storage Cluster (e.g., Neo4j, JanusGraph with Cassandra)
  |
  v
Replication & Backup Storage

Monitoring & Alerting System (parallel)

Components

API Servers

REST/gRPC services

Handle client requests for adding/removing connections and querying social graph

Load Balancer

Nginx or cloud LB

Distribute incoming requests evenly across API servers

Cache Layer

Redis or Memcached

Cache frequent queries like user connections to reduce latency

Graph Storage Cluster

Neo4j or JanusGraph with Cassandra backend

Store user nodes and edges representing connections; support graph queries

Replication & Backup Storage

Cassandra or distributed file system

Ensure data durability and availability across data centers

Monitoring & Alerting System

Prometheus, Grafana

Track system health, latency, errors, and trigger alerts

Request Flow

1. Client sends request to add/remove connection or query connections.

2. Load balancer routes request to one of the API servers.

3. API server checks cache for query requests; if cache miss, queries graph storage.

4. For write requests, API server updates graph storage and invalidates relevant cache entries.

5. Graph storage persists changes and replicates data asynchronously.

6. API server returns response to client with connection data or confirmation.

7. Monitoring system collects metrics from all components continuously.

Database Schema

Entities: - User: user_id (PK), name, profile_data - Connection: from_user_id (FK to User), to_user_id (FK to User), connection_type, created_at Relationships: - User to Connection is 1:N (one user can have many connections) - Connections can be directed (follower) or undirected (friendship stored as two directed edges) Indexes: - Index on from_user_id for fast lookup of user's connections - Composite index on (from_user_id, to_user_id) for quick existence checks Storage: - Use adjacency list model storing edges per user node for efficient traversal

Scaling Discussion

Bottlenecks

Graph storage cluster can become slow with very large number of edges per user.

Cache invalidation complexity increases with frequent writes.

Load balancer and API servers can become overwhelmed with high request volume.

Replication lag can cause stale reads in multi-region setups.

Solutions

Shard graph data by user_id ranges or hash to distribute load across multiple storage nodes.

Use write-back cache with TTL and selective invalidation to balance freshness and performance.

Auto-scale API servers and load balancers based on traffic patterns.

Implement multi-master replication with conflict resolution or use read-your-writes consistency models.

Interview Tips

Time: Spend 10 minutes clarifying requirements and constraints, 20 minutes designing architecture and data model, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing.

Clarify types of connections and query patterns before designing.

Explain choice of graph database and caching for performance.

Discuss data model focusing on adjacency list for efficient queries.

Highlight how to handle scale with sharding and replication.

Mention trade-offs between consistency and availability.

Show awareness of monitoring and operational concerns.