0
0
Microservicessystem_design~7 mins

Retry with exponential backoff in Microservices - System Design Guide

Choose your learning style9 modes available
Problem Statement
When a microservice call fails due to temporary issues like network glitches or rate limits, immediately retrying the request repeatedly can overload the system and cause cascading failures. This leads to longer outages and poor user experience.
Solution
Retry with exponential backoff gradually increases the wait time between retries, starting with a short delay and doubling it after each failure. This reduces the retry rate under failure conditions, giving the system time to recover and preventing overload.
Architecture
Client
Service A
Retry Logic with Exponential Backoff
┌───────────────┐

This diagram shows a client calling Service A, which calls Service B. If Service B fails, the retry logic with exponential backoff controls the timing of retries to avoid overwhelming Service B.

Trade-offs
✓ Pros
Reduces load on failing services by spacing out retries.
Improves system stability during transient failures.
Prevents cascading failures in distributed systems.
Simple to implement and widely supported.
✗ Cons
Increases overall latency for successful requests after failures.
Requires careful tuning of initial delay, max delay, and max retries.
Does not solve permanent failures; may delay error detection.
Use when your system calls external or internal services that may fail temporarily, especially under load or network issues. Suitable for systems with retryable transient errors and where preventing overload is critical.
Avoid when failures are permanent or non-retryable, or when low latency is critical and retries would cause unacceptable delays.
Real World Examples
Amazon
Amazon uses exponential backoff in AWS SDKs to handle throttling errors from services like DynamoDB, reducing retry storms and improving client stability.
Netflix
Netflix applies exponential backoff in its microservices communication to handle transient network failures and rate limits, improving resilience and user experience.
Google
Google Cloud APIs implement exponential backoff to manage quota limits and transient errors, ensuring fair usage and system stability.
Code Example
The before code retries immediately on failure, causing potential overload. The after code waits longer between retries, doubling the delay each time up to a max, reducing retry frequency and giving the service time to recover.
Microservices
import time
import random

# Before: naive retry without backoff

def call_service_naive():
    for _ in range(5):
        try:
            # simulate service call
            result = unreliable_service()
            return result
        except Exception:
            pass  # immediately retry
    raise Exception("Failed after retries")

# After: retry with exponential backoff

def call_service_backoff():
    max_retries = 5
    delay = 0.5  # initial delay in seconds
    max_delay = 8
    for attempt in range(max_retries):
        try:
            result = unreliable_service()
            return result
        except Exception:
            if attempt == max_retries - 1:
                break
            sleep_time = delay * (2 ** attempt)
            sleep_time = min(sleep_time, max_delay)
            time.sleep(sleep_time)
    raise Exception("Failed after retries with backoff")

# Simulated unreliable service

def unreliable_service():
    if random.random() < 0.7:
        raise Exception("Temporary failure")
    return "Success"
OutputSuccess
Alternatives
Fixed Interval Retry
Retries happen after a constant fixed delay regardless of failure count.
Use when: Use when system load is low and simplicity is preferred over adaptive retry timing.
Circuit Breaker
Stops retries entirely after a threshold of failures to prevent overload, then tests service health before resuming.
Use when: Use when you want to fail fast and avoid retrying during prolonged outages.
Jittered Exponential Backoff
Adds random variation (jitter) to the exponential backoff delay to prevent synchronized retries.
Use when: Use when many clients retry simultaneously to avoid retry storms.
Summary
Retry with exponential backoff spaces out retries by increasing wait times after failures.
It prevents overload and cascading failures in distributed microservices.
Proper tuning and combining with other patterns like circuit breakers improves system resilience.