HLDsystem_design~10 mins

Metrics collection in HLD - Scalability & System Analysis

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Scalability Analysis - Metrics collection

Growth Table: Metrics Collection System

Scale	Users / Devices	Metrics Volume	Data Storage	Query Load	System Changes
Small (100 users)	100 devices	~10K metrics/min	Single DB instance	Low QPS	Simple ingestion, single server
Medium (10K users)	10,000 devices	~1M metrics/min	DB with read replicas	Moderate QPS	Introduce caching, load balancer
Large (1M users)	1,000,000 devices	~100M metrics/min	Sharded DB, distributed storage	High QPS	Partitioning, message queues, batch processing
Very Large (100M users)	100,000,000 devices	~10B metrics/min	Multi-region distributed storage	Very high QPS	Advanced sharding, edge processing, CDN for dashboards

First Bottleneck

At small to medium scale, the database is the first bottleneck because it must handle many writes and queries per second. The ingestion pipeline can overwhelm the DB with high write QPS. As scale grows, network bandwidth and storage I/O also become bottlenecks.

Scaling Solutions

Horizontal scaling: Add more ingestion servers behind a load balancer to handle more incoming metrics.
Caching: Use in-memory caches (e.g., Redis) to reduce read load on the database for frequent queries.
Sharding: Partition data by device ID or time to distribute load across multiple database instances.
Message queues: Use Kafka or similar to buffer and batch writes, smoothing spikes in traffic.
Distributed storage: Use scalable time-series databases or cloud storage optimized for large write volumes.
Edge processing: Aggregate or filter metrics near the source to reduce data volume sent to central servers.
CDN for dashboards: Cache dashboard data closer to users to reduce backend query load.

Back-of-Envelope Cost Analysis

At 1M users sending 100 metrics/minute: 100 million metrics/minute ≈ 1.7 million writes/second.
Storage needed: Assuming 200 bytes per metric, 1.7M writes/sec × 200 bytes ≈ 340 MB/s write throughput.
Network bandwidth: 340 MB/s ≈ 2.7 Gbps, requiring multiple network interfaces or distributed ingestion.
Database QPS: Single DB handles ~5K QPS, so need ~340 DB shards or use specialized time-series DB.
Cost scales with number of servers, storage, and network infrastructure.

Interview Tip

Start by clarifying the scale and data volume. Identify the main bottleneck (usually DB writes). Discuss incremental scaling steps: caching, sharding, message queues. Mention trade-offs like consistency vs latency. Show awareness of cost and complexity.

Self Check

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Introduce read replicas and caching to reduce DB load, and implement message queues to batch writes. Consider sharding data to distribute load across multiple DB instances.

Key Result

Metrics collection systems first hit database write bottlenecks as user and data volume grow; scaling requires horizontal ingestion, caching, sharding, and distributed storage.