| Users/Requests | Retry Behavior | Impact on System | Potential Issues |
|---|---|---|---|
| 100 requests/sec | Few retries, small delays (e.g., 100ms, 200ms) | Minimal extra load, retries rarely overlap | Retries usually succeed quickly, no overload |
| 10,000 requests/sec | More retries, delays grow exponentially (e.g., 100ms, 200ms, 400ms...) | Increased load spikes during retries, some congestion possible | Risk of retry storms if many fail simultaneously |
| 1,000,000 requests/sec | Many retries with longer delays, jitter added to spread retries | High load on services, possible cascading failures if not controlled | Retries can cause resource exhaustion, latency spikes |
| 100,000,000 requests/sec | Retries must be carefully throttled, circuit breakers used | System needs advanced controls to prevent overload | Without controls, retries cause system-wide outages |
Retry with exponential backoff in Microservices - Scalability & System Analysis
The first bottleneck is the service receiving retries. When many clients retry simultaneously, the service CPU and memory get overwhelmed. This happens because retries increase the number of requests beyond normal traffic, causing resource exhaustion.
- Exponential backoff with jitter: Add randomness to retry delays to avoid retry storms.
- Rate limiting retries: Limit how many retries a client can do per time unit.
- Circuit breakers: Temporarily stop retries when the service is unhealthy.
- Horizontal scaling: Add more service instances to handle increased load.
- Load balancing: Distribute retry requests evenly across instances.
- Caching and idempotency: Reduce load by caching responses and making retries safe.
Assuming 10,000 requests/sec with 20% failure rate triggering retries:
- Initial requests: 10,000/sec
- Retries: 2,000/sec (20% of 10,000)
- With exponential backoff, retries spread over time, peak retry rate ~500/sec
- Service must handle ~10,500 requests/sec peak
- Bandwidth and CPU must scale accordingly; add 5-10% overhead for retries
When discussing retry with exponential backoff, start by explaining the problem retries solve. Then describe how exponential backoff reduces retry storms. Next, mention adding jitter and circuit breakers to improve stability. Finally, discuss scaling the service horizontally and rate limiting retries to handle growth.
Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS with retries increasing load further. What do you do first?
Answer: Implement exponential backoff with jitter and circuit breakers to reduce retry load, then horizontally scale the database with read replicas and connection pooling to handle increased QPS.