MicroservicesDebug / FixIntermediate · 4 min read

How to Handle Failures in Microservices: Best Practices

To handle failures in microservices, use retry mechanisms, circuit breakers, and fallback methods to prevent cascading errors and improve system resilience. Implementing timeouts and bulkheads also helps isolate failures and maintain service availability.

🔍

Why This Happens

Failures in microservices happen because services depend on each other over the network, which can be slow or unreliable. If one service is down or slow, it can cause other services to fail or hang, leading to a chain reaction. Without proper handling, this can crash the whole system.

javascript

async function callService() {
  const response = await fetch('http://service-b/api/data');
  const data = await response.json();
  return data;
}

// No error handling or timeout here

Output

UnhandledPromiseRejectionWarning: FetchError: request to http://service-b/api/data failed, reason: connect ECONNREFUSED

🔧

The Fix

Fix this by adding retries with delays, timeouts to avoid waiting forever, and a circuit breaker to stop calling a failing service temporarily. Also, provide fallback data or behavior when the service is down.

javascript

import fetch from 'node-fetch';

class CircuitBreaker {
  constructor() {
    this.failures = 0;
    this.threshold = 3;
    this.open = false;
    this.resetTimeout = 5000;
  }

  async call(fn) {
    if (this.open) {
      throw new Error('Circuit breaker is open');
    }
    try {
      const result = await fn();
      this.failures = 0;
      return result;
    } catch (e) {
      this.failures++;
      if (this.failures >= this.threshold) {
        this.open = true;
        setTimeout(() => { this.open = false; this.failures = 0; }, this.resetTimeout);
      }
      throw e;
    }
  }
}

async function fetchWithTimeout(url, timeout = 3000) {
  return Promise.race([
    fetch(url),
    new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), timeout))
  ]);
}

const breaker = new CircuitBreaker();

async function callService() {
  try {
    const response = await breaker.call(() => fetchWithTimeout('http://service-b/api/data'));
    if (!response.ok) throw new Error('Bad response');
    return await response.json();
  } catch (e) {
    console.log('Service failed, returning fallback');
    return { data: 'fallback data' };
  }
}

Output

Service failed, returning fallback { data: 'fallback data' }

🛡️

Prevention

Prevent failures by designing microservices to be resilient: use retries with exponential backoff, circuit breakers to stop repeated calls to failing services, and fallback methods to provide default responses. Also, isolate services with bulkheads so one failure doesn't affect others. Monitor services and set alerts to fix issues early.

Best Practice	Description
Retries with Backoff	Retry failed requests with increasing delay to avoid overload.
Circuit Breaker	Stop calling a failing service temporarily to prevent cascading failures.
Fallbacks	Provide default data or behavior when a service is down.
Timeouts	Limit wait time for service responses to avoid hanging.
Bulkheads	Isolate resources so failures don't spread across services.
Monitoring & Alerts	Track service health and get notified of failures quickly.

⚠️

Related Errors

Common related errors include timeout errors when a service takes too long to respond, connection refused when a service is down, and cascading failures where one failure causes many others. Fix these by applying timeouts, circuit breakers, and fallback strategies as shown.

✅

Key Takeaways

Use retries, circuit breakers, and fallbacks to handle microservice failures gracefully.

Set timeouts to avoid waiting indefinitely for slow or down services.

Isolate failures with bulkheads to prevent cascading errors.

Monitor services continuously to detect and fix issues early.

Design for failure as a normal part of distributed systems.