Bird
Raised Fist0
HLDsystem_design~10 mins

Online presence system in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Online presence system
Growth Table: Online Presence System
ScaleUsersActive ConnectionsData StoredTraffic CharacteristicsSystem Changes
Small100~100 concurrentMBs (presence states)Low, few updates per secondSingle server, simple DB, no caching
Medium10,000~5,000 concurrentGBs (presence logs, user states)Moderate, frequent presence updatesLoad balancer, DB replicas, caching layer
Large1,000,000~500,000 concurrentTBs (history, analytics)High, real-time updates, many events/secHorizontal scaling, sharding, pub/sub messaging
Very Large100,000,000~50,000,000 concurrentPetabytes (long-term storage)Very high, global distribution, multi-regionGlobal CDN, geo-sharding, multi-region DB, edge caching
First Bottleneck

At small scale, the database is the first bottleneck because it must handle many frequent presence updates and queries. As users grow, the DB write throughput and connection limits are stressed first.

Scaling Solutions
  • Database scaling: Use read replicas to offload reads, connection pooling, and write sharding to distribute load.
  • Caching: Cache presence states in fast in-memory stores like Redis to reduce DB hits.
  • Horizontal scaling: Add more application servers behind load balancers to handle concurrent connections.
  • Messaging: Use pub/sub systems (e.g., Kafka, Redis Streams) to propagate presence updates efficiently.
  • Global distribution: Geo-shard data and use CDNs or edge caches to reduce latency for worldwide users.
Back-of-Envelope Cost Analysis
  • Requests per second: For 1M users with 50% active, ~500K concurrent connections, each sending updates every 10 seconds -> ~50K QPS.
  • Storage: Presence states are small (~100 bytes per user), but history logs grow fast. For 1M users, daily logs ~10GB; yearly ~3.6TB.
  • Bandwidth: Each update ~100 bytes, 50K QPS -> ~5MB/s (~40Mbps), manageable with 1Gbps network.
Interview Tip

Start by defining the scale and key metrics (users, connections, update frequency). Identify the first bottleneck (usually DB). Then discuss scaling strategies step-by-step: caching, read replicas, horizontal scaling, messaging, and global distribution. Always justify why each solution fits the bottleneck.

Self Check

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas and implement caching to reduce direct DB load before considering sharding or adding more servers.

Key Result
The database is the first bottleneck as user count and presence updates grow; scaling requires caching, read replicas, and horizontal scaling of app servers to handle real-time presence efficiently.