0
0
HLDsystem_design~10 mins

Metrics collection in HLD - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Metrics collection
Growth Table: Metrics Collection System
ScaleUsers / DevicesMetrics VolumeData StorageQuery LoadSystem Changes
Small (100 users)100 devices~10K metrics/minSingle DB instanceLow QPSSimple ingestion, single server
Medium (10K users)10,000 devices~1M metrics/minDB with read replicasModerate QPSIntroduce caching, load balancer
Large (1M users)1,000,000 devices~100M metrics/minSharded DB, distributed storageHigh QPSPartitioning, message queues, batch processing
Very Large (100M users)100,000,000 devices~10B metrics/minMulti-region distributed storageVery high QPSAdvanced sharding, edge processing, CDN for dashboards
First Bottleneck

At small to medium scale, the database is the first bottleneck because it must handle many writes and queries per second. The ingestion pipeline can overwhelm the DB with high write QPS. As scale grows, network bandwidth and storage I/O also become bottlenecks.

Scaling Solutions
  • Horizontal scaling: Add more ingestion servers behind a load balancer to handle more incoming metrics.
  • Caching: Use in-memory caches (e.g., Redis) to reduce read load on the database for frequent queries.
  • Sharding: Partition data by device ID or time to distribute load across multiple database instances.
  • Message queues: Use Kafka or similar to buffer and batch writes, smoothing spikes in traffic.
  • Distributed storage: Use scalable time-series databases or cloud storage optimized for large write volumes.
  • Edge processing: Aggregate or filter metrics near the source to reduce data volume sent to central servers.
  • CDN for dashboards: Cache dashboard data closer to users to reduce backend query load.
Back-of-Envelope Cost Analysis
  • At 1M users sending 100 metrics/minute: 100 million metrics/minute ≈ 1.7 million writes/second.
  • Storage needed: Assuming 200 bytes per metric, 1.7M writes/sec × 200 bytes ≈ 340 MB/s write throughput.
  • Network bandwidth: 340 MB/s ≈ 2.7 Gbps, requiring multiple network interfaces or distributed ingestion.
  • Database QPS: Single DB handles ~5K QPS, so need ~340 DB shards or use specialized time-series DB.
  • Cost scales with number of servers, storage, and network infrastructure.
Interview Tip

Start by clarifying the scale and data volume. Identify the main bottleneck (usually DB writes). Discuss incremental scaling steps: caching, sharding, message queues. Mention trade-offs like consistency vs latency. Show awareness of cost and complexity.

Self Check

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Introduce read replicas and caching to reduce DB load, and implement message queues to batch writes. Consider sharding data to distribute load across multiple DB instances.

Key Result
Metrics collection systems first hit database write bottlenecks as user and data volume grow; scaling requires horizontal ingestion, caching, sharding, and distributed storage.