| Users / Services | What Changes? |
|---|---|
| 100 users / 10 services | Basic mTLS setup with certificates issued by internal CA. Low latency impact. Simple certificate rotation. |
| 10,000 users / 100 services | Certificate management grows complex. Need automated certificate issuance and rotation. Increased CPU usage for TLS handshakes. |
| 1,000,000 users / 1,000+ services | High TLS handshake overhead impacts service latency. Certificate revocation and trust management become challenging. Need centralized certificate management and caching TLS sessions. |
| 100,000,000 users / 10,000+ services | Network bandwidth and CPU load from TLS dominate. Must implement TLS session resumption, hardware acceleration, and distributed trust stores. Monitoring and alerting critical. |
Mutual TLS between services in Microservices - Scalability & System Analysis
The first bottleneck is the CPU load on service instances due to TLS handshake overhead. Each mutual TLS connection requires cryptographic operations that consume CPU. As the number of services and connections grows, CPU becomes saturated, increasing latency and reducing throughput.
- Session Resumption: Use TLS session tickets or IDs to avoid full handshakes on repeated connections.
- Connection Pooling: Reuse TLS connections between services to reduce handshake frequency.
- Hardware Acceleration: Use CPUs with crypto acceleration or dedicated TLS offload hardware.
- Centralized Certificate Management: Automate certificate issuance, rotation, and revocation with tools like SPIFFE/SPIRE or Vault.
- Load Balancing: Distribute traffic to avoid CPU hotspots.
- Caching Trust Data: Cache certificate validation results to reduce repeated expensive operations.
- Assuming 1000 concurrent connections per server, each TLS handshake takes ~10-50ms CPU time.
- At 10,000 services, with 10 handshakes per second each, total TLS handshakes = 100,000/sec.
- CPU load for TLS handshakes can saturate multiple servers; need horizontal scaling.
- Storage for certificates: Each certificate ~2KB, 10,000 services = ~20MB, manageable in memory.
- Network bandwidth impact: TLS adds ~5-10% overhead on data transferred.
Start by explaining what mutual TLS is and why it is used for service-to-service authentication and encryption. Then discuss how TLS handshake overhead impacts CPU and latency as scale grows. Mention certificate management complexity. Finally, propose concrete scaling solutions like session resumption, connection pooling, and automated certificate management. Use numbers to justify bottlenecks and solutions.
Your database handles 1000 QPS. Traffic grows 10x. What do you do first?
Answer: Since the database is the bottleneck at 1000 QPS, first add read replicas and implement caching to reduce load. For mutual TLS, similarly, if CPU is bottleneck due to TLS handshakes, first implement TLS session resumption and connection reuse to reduce CPU load.