| Users / Services | Log Volume | Infrastructure Changes | Challenges |
|---|---|---|---|
| 100 users / 10 services | ~10K logs/day | Single ELK stack instance; basic log shipping | Minimal latency; easy to manage |
| 10K users / 100 services | ~1M logs/day | Scale Elasticsearch cluster; add Logstash nodes; use Kafka for buffering | Indexing delays; storage growth; query slowdowns |
| 1M users / 1000 services | ~100M logs/day | Multi-node Elasticsearch clusters with sharding; dedicated Kafka clusters; use Elasticsearch cross-cluster search | Storage cost; query performance; cluster management complexity |
| 100M users / 10K services | ~10B logs/day | Multiple ELK clusters per region; heavy use of data tiering and archival; advanced indexing strategies; use of cloud storage for cold data | High operational cost; data retention policies; disaster recovery |
Centralized logging (ELK stack) in Microservices - Scalability & System Analysis
The first bottleneck is usually the Elasticsearch cluster. As log volume grows, Elasticsearch struggles with indexing speed and query latency due to disk I/O and CPU limits.
- Horizontal Scaling: Add more Elasticsearch nodes and shard indices to distribute load.
- Buffering: Use Kafka or similar message queues to decouple log producers from Elasticsearch ingestion.
- Caching: Use Elasticsearch query caching and Kibana dashboards caching to reduce repeated query load.
- Data Tiering: Move older logs to cheaper storage tiers or cold storage to reduce hot cluster load.
- Index Lifecycle Management: Automate index rollover and deletion to manage storage efficiently.
- Load Balancing: Distribute incoming log traffic evenly across Logstash or Beats agents.
- Compression: Compress logs during transport and storage to save bandwidth and disk space.
- At 1M logs/day (~11.5 logs/sec), Elasticsearch indexing requires ~100-200 MB/s disk throughput.
- Storage needed: Assuming 1 KB per log, 1M logs/day = ~1 GB/day; 1 year = ~365 GB.
- Network bandwidth: For 1M logs/day, ~1 MB/s sustained bandwidth needed for log shipping.
- CPU: Elasticsearch nodes need multiple cores (8+) for indexing and query processing at medium scale.
- Memory: Elasticsearch benefits from large heap sizes (16-32 GB) for caching and indexing.
Start by explaining the data flow: microservices generate logs → logs are shipped via agents (Beats) → buffered by Kafka → processed by Logstash → stored in Elasticsearch → visualized in Kibana.
Discuss bottlenecks focusing on Elasticsearch indexing and query performance. Then propose scaling solutions like sharding, buffering, and data tiering. Mention cost trade-offs and operational complexity.
Your Elasticsearch cluster handles 1000 queries per second (QPS). Traffic grows 10x to 10,000 QPS. What do you do first and why?
Answer: Add more Elasticsearch nodes and increase shard count to distribute indexing and query load horizontally. This prevents CPU and disk I/O bottlenecks and maintains query latency.