HLDsystem_design~10 mins

Online presence system in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - Online presence system

Growth Table: Online Presence System

Scale	Users	Active Connections	Data Stored	Traffic Characteristics	System Changes
Small	100	~100 concurrent	MBs (presence states)	Low, few updates per second	Single server, simple DB, no caching
Medium	10,000	~5,000 concurrent	GBs (presence logs, user states)	Moderate, frequent presence updates	Load balancer, DB replicas, caching layer
Large	1,000,000	~500,000 concurrent	TBs (history, analytics)	High, real-time updates, many events/sec	Horizontal scaling, sharding, pub/sub messaging
Very Large	100,000,000	~50,000,000 concurrent	Petabytes (long-term storage)	Very high, global distribution, multi-region	Global CDN, geo-sharding, multi-region DB, edge caching

First Bottleneck

At small scale, the database is the first bottleneck because it must handle many frequent presence updates and queries. As users grow, the DB write throughput and connection limits are stressed first.

Scaling Solutions

Database scaling: Use read replicas to offload reads, connection pooling, and write sharding to distribute load.
Caching: Cache presence states in fast in-memory stores like Redis to reduce DB hits.
Horizontal scaling: Add more application servers behind load balancers to handle concurrent connections.
Messaging: Use pub/sub systems (e.g., Kafka, Redis Streams) to propagate presence updates efficiently.
Global distribution: Geo-shard data and use CDNs or edge caches to reduce latency for worldwide users.

Back-of-Envelope Cost Analysis

Requests per second: For 1M users with 50% active, ~500K concurrent connections, each sending updates every 10 seconds -> ~50K QPS.
Storage: Presence states are small (~100 bytes per user), but history logs grow fast. For 1M users, daily logs ~10GB; yearly ~3.6TB.
Bandwidth: Each update ~100 bytes, 50K QPS -> ~5MB/s (~40Mbps), manageable with 1Gbps network.

Interview Tip

Start by defining the scale and key metrics (users, connections, update frequency). Identify the first bottleneck (usually DB). Then discuss scaling strategies step-by-step: caching, read replicas, horizontal scaling, messaging, and global distribution. Always justify why each solution fits the bottleneck.

Self Check

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas and implement caching to reduce direct DB load before considering sharding or adding more servers.

Key Result

The database is the first bottleneck as user count and presence updates grow; scaling requires caching, read replicas, and horizontal scaling of app servers to handle real-time presence efficiently.