| Users / Services | Log Volume | Infrastructure Changes | Challenges |
|---|---|---|---|
| 100 users / 10 services | ~10K logs/day | Single ELK stack instance; basic log shipping | Minimal latency; easy to manage |
| 10K users / 100 services | ~1M logs/day | Scale Elasticsearch cluster; add Logstash nodes; use Kafka for buffering | Indexing delays; storage growth; query slowdowns |
| 1M users / 1000 services | ~100M logs/day | Multi-node Elasticsearch clusters with sharding; dedicated Kafka clusters; use Elasticsearch cross-cluster search | Storage cost; query performance; cluster management complexity |
| 100M users / 10K services | ~10B logs/day | Multiple ELK clusters per region; heavy use of data tiering and archival; advanced indexing strategies; use of cloud storage for cold data | High operational cost; data retention policies; disaster recovery |
Centralized logging (ELK stack) in Microservices - Scalability & System Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
The first bottleneck is usually the Elasticsearch cluster. As log volume grows, Elasticsearch struggles with indexing speed and query latency due to disk I/O and CPU limits.
- Horizontal Scaling: Add more Elasticsearch nodes and shard indices to distribute load.
- Buffering: Use Kafka or similar message queues to decouple log producers from Elasticsearch ingestion.
- Caching: Use Elasticsearch query caching and Kibana dashboards caching to reduce repeated query load.
- Data Tiering: Move older logs to cheaper storage tiers or cold storage to reduce hot cluster load.
- Index Lifecycle Management: Automate index rollover and deletion to manage storage efficiently.
- Load Balancing: Distribute incoming log traffic evenly across Logstash or Beats agents.
- Compression: Compress logs during transport and storage to save bandwidth and disk space.
- At 1M logs/day (~11.5 logs/sec), Elasticsearch indexing requires ~100-200 MB/s disk throughput.
- Storage needed: Assuming 1 KB per log, 1M logs/day = ~1 GB/day; 1 year = ~365 GB.
- Network bandwidth: For 1M logs/day, ~1 MB/s sustained bandwidth needed for log shipping.
- CPU: Elasticsearch nodes need multiple cores (8+) for indexing and query processing at medium scale.
- Memory: Elasticsearch benefits from large heap sizes (16-32 GB) for caching and indexing.
Start by explaining the data flow: microservices generate logs → logs are shipped via agents (Beats) → buffered by Kafka → processed by Logstash → stored in Elasticsearch → visualized in Kibana.
Discuss bottlenecks focusing on Elasticsearch indexing and query performance. Then propose scaling solutions like sharding, buffering, and data tiering. Mention cost trade-offs and operational complexity.
Your Elasticsearch cluster handles 1000 queries per second (QPS). Traffic grows 10x to 10,000 QPS. What do you do first and why?
Answer: Add more Elasticsearch nodes and increase shard count to distribute indexing and query load horizontally. This prevents CPU and disk I/O bottlenecks and maintains query latency.
Practice
Solution
Step 1: Understand ELK stack components
ELK stands for Elasticsearch (storage), Logstash (processing), and Kibana (visualization), all focused on logs.Step 2: Identify ELK stack role in microservices
It centralizes logs from many services to one place for easier monitoring and troubleshooting.Final Answer:
To collect, store, and visualize logs from multiple services in one place -> Option CQuick Check:
ELK stack = centralized logging [OK]
- Confusing ELK with deployment tools
- Thinking ELK manages databases
- Assuming ELK monitors network traffic
Solution
Step 1: Recall ELK stack components
Elasticsearch stores logs, Logstash processes, Kibana visualizes, Filebeat ships logs.Step 2: Identify correct service name in Docker Compose
The service running Elasticsearch is named "elasticsearch" in Docker Compose files.Final Answer:
elasticsearch -> Option AQuick Check:
Elasticsearch service = elasticsearch [OK]
- Confusing Logstash or Kibana as Elasticsearch service
- Using 'filebeat' as ELK core service
- Misspelling service names
input { beats { port => 5044 } } output { elasticsearch { hosts => ["http://elasticsearch:9200"] } }What happens when Logstash receives logs on port 5044?
Solution
Step 1: Analyze Logstash input configuration
Logstash listens for logs from Beats agents on port 5044.Step 2: Analyze Logstash output configuration
Logs received are forwarded to Elasticsearch at the specified host and port.Final Answer:
Logs are sent to Elasticsearch at http://elasticsearch:9200 -> Option BQuick Check:
Logstash input port 5044 forwards logs to Elasticsearch [OK]
- Assuming logs go directly to Kibana
- Thinking port 5044 is invalid
- Believing logs are stored locally on Logstash
Solution
Step 1: Check connectivity between Logstash and Elasticsearch
If Elasticsearch is down or unreachable, Logstash cannot send logs to it.Step 2: Verify other options
Kibana not running or missing does not stop logs from reaching Elasticsearch; wrong input port would prevent Logstash from receiving logs, not sending.Final Answer:
Elasticsearch service is down or unreachable -> Option DQuick Check:
Logs missing usually means Elasticsearch unreachable [OK]
- Blaming Kibana for missing logs in Elasticsearch
- Confusing input port with Elasticsearch port
- Ignoring Elasticsearch service health
Solution
Step 1: Setup Filebeat on microservice host
Filebeat collects logs locally and forwards them to Logstash on port 5044.Step 2: Ensure ELK stack components are running
Logstash processes logs, sends them to Elasticsearch, and Kibana visualizes them.Final Answer:
Install Filebeat on the microservice host, configure it to send logs to Logstash on port 5044, and verify Elasticsearch and Kibana are running -> Option AQuick Check:
Filebeat -> Logstash -> Elasticsearch -> Kibana [OK]
- Trying to send logs directly to Kibana
- Expecting Elasticsearch to pull logs automatically
- Running Logstash on microservice host unnecessarily
