Bird
Raised Fist0
HLDsystem_design~10 mins

Group messaging in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Group messaging
Growth Table: Group Messaging System
UsersMessages/DayActive GroupsStorage SizeServer LoadNetwork Traffic
10010K50~100 MB1 app serverLow
10,0001M5,000~10 GB3-5 app serversModerate
1,000,000100M500,000~1 TB50+ app servers, DB clusterHigh
100,000,00010B50M~100+ TBHundreds of servers, sharded DBVery High
First Bottleneck

At small scale (up to 10K users), the database write throughput is the first bottleneck because every message must be stored reliably. The database can handle around 5,000-10,000 writes per second, so as message volume grows, it will slow down.

At medium scale (100K+ users), application servers CPU and memory become bottlenecks due to message fan-out (delivering messages to many group members).

At large scale (millions of users), network bandwidth and storage size become bottlenecks, requiring data partitioning and efficient delivery mechanisms.

Scaling Solutions
  • Database scaling: Use read replicas for reads, write sharding by group ID to distribute writes.
  • Caching: Cache recent messages per group in Redis to reduce DB reads.
  • Horizontal scaling: Add more app servers behind load balancers to handle concurrent connections and message fan-out.
  • Message queue: Use message brokers (e.g., Kafka) to decouple message ingestion and delivery.
  • CDN and push notifications: Use CDN for media content and push notifications for offline users.
  • Data archiving: Archive old messages to cheaper storage to reduce DB size.
Back-of-Envelope Cost Analysis

Assuming 1M users sending 100 messages/day:

  • Messages per second (QPS): ~1,000,000 users * 100 messages / 86400 seconds ≈ 1157 QPS
  • Storage: 100 bytes per message * 100M messages/day = ~10 GB/day
  • Network bandwidth: Assuming 1 KB per message delivered to 10 recipients on average = 1157 QPS * 1 KB * 10 = ~11.57 MB/s (~92 Mbps)
  • App servers: Each server handles ~2000 concurrent connections and message fan-out; need ~10-20 servers
  • Database: Must support ~1200 writes/sec and higher reads; use sharding and replicas
Interview Tip

Start by defining key metrics: users, messages per user, group size. Then identify bottlenecks step-by-step: database writes, message delivery, storage. Discuss scaling strategies for each bottleneck clearly. Use real numbers to justify your choices. Always mention trade-offs and fallback plans.

Self Check

Question: Your database handles 1000 QPS. Traffic grows 10x. What do you do first?

Answer: The first step is to add read replicas to offload read traffic and implement write sharding by group ID to distribute write load across multiple database instances. This prevents the single DB from becoming a bottleneck.

Key Result
The database write throughput is the first bottleneck at small scale; scaling requires sharding and caching. At larger scale, app servers and network bandwidth become bottlenecks, solved by horizontal scaling, message queues, and data partitioning.