0
0
Microservicessystem_design~10 mins

Centralized logging (ELK stack) in Microservices - Scalability & System Analysis

Choose your learning style9 modes available
Scalability Analysis - Centralized logging (ELK stack)
Growth Table: Centralized Logging with ELK Stack
Users / ServicesLog VolumeInfrastructure ChangesChallenges
100 users / 10 services~10K logs/daySingle ELK stack instance; basic log shippingMinimal latency; easy to manage
10K users / 100 services~1M logs/dayScale Elasticsearch cluster; add Logstash nodes; use Kafka for bufferingIndexing delays; storage growth; query slowdowns
1M users / 1000 services~100M logs/dayMulti-node Elasticsearch clusters with sharding; dedicated Kafka clusters; use Elasticsearch cross-cluster searchStorage cost; query performance; cluster management complexity
100M users / 10K services~10B logs/dayMultiple ELK clusters per region; heavy use of data tiering and archival; advanced indexing strategies; use of cloud storage for cold dataHigh operational cost; data retention policies; disaster recovery
First Bottleneck

The first bottleneck is usually the Elasticsearch cluster. As log volume grows, Elasticsearch struggles with indexing speed and query latency due to disk I/O and CPU limits.

Scaling Solutions
  • Horizontal Scaling: Add more Elasticsearch nodes and shard indices to distribute load.
  • Buffering: Use Kafka or similar message queues to decouple log producers from Elasticsearch ingestion.
  • Caching: Use Elasticsearch query caching and Kibana dashboards caching to reduce repeated query load.
  • Data Tiering: Move older logs to cheaper storage tiers or cold storage to reduce hot cluster load.
  • Index Lifecycle Management: Automate index rollover and deletion to manage storage efficiently.
  • Load Balancing: Distribute incoming log traffic evenly across Logstash or Beats agents.
  • Compression: Compress logs during transport and storage to save bandwidth and disk space.
Back-of-Envelope Cost Analysis
  • At 1M logs/day (~11.5 logs/sec), Elasticsearch indexing requires ~100-200 MB/s disk throughput.
  • Storage needed: Assuming 1 KB per log, 1M logs/day = ~1 GB/day; 1 year = ~365 GB.
  • Network bandwidth: For 1M logs/day, ~1 MB/s sustained bandwidth needed for log shipping.
  • CPU: Elasticsearch nodes need multiple cores (8+) for indexing and query processing at medium scale.
  • Memory: Elasticsearch benefits from large heap sizes (16-32 GB) for caching and indexing.
Interview Tip

Start by explaining the data flow: microservices generate logs → logs are shipped via agents (Beats) → buffered by Kafka → processed by Logstash → stored in Elasticsearch → visualized in Kibana.

Discuss bottlenecks focusing on Elasticsearch indexing and query performance. Then propose scaling solutions like sharding, buffering, and data tiering. Mention cost trade-offs and operational complexity.

Self Check Question

Your Elasticsearch cluster handles 1000 queries per second (QPS). Traffic grows 10x to 10,000 QPS. What do you do first and why?

Answer: Add more Elasticsearch nodes and increase shard count to distribute indexing and query load horizontally. This prevents CPU and disk I/O bottlenecks and maintains query latency.

Key Result
Elasticsearch indexing and query performance is the first bottleneck as log volume grows; horizontal scaling with sharding and buffering with Kafka are key to scaling ELK stack for centralized logging.