Bird
Raised Fist0
Microservicessystem_design~10 mins

Graceful degradation in Microservices - Scalability & System Analysis

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Scalability Analysis - Graceful degradation
Growth Table: Graceful Degradation in Microservices
UsersTraffic CharacteristicsSystem BehaviorDegradation Strategy
100 usersLow requests, low concurrencyAll services fully operationalNo degradation needed
10,000 usersModerate requests, some spikesMinor latency in non-critical servicesDisable non-essential features temporarily
1,000,000 usersHigh requests, frequent spikesSome services slow or partially unavailableFallback to cached data, limit feature set, circuit breakers active
100,000,000 usersVery high sustained trafficCritical services prioritized, degraded UI, partial dataFull graceful degradation: disable heavy features, serve static content, queue requests
First Bottleneck

In microservices, the first bottleneck during high load is usually the downstream dependent services or databases. When a service depends on another slow or overloaded service, it causes cascading delays. This leads to increased latency and potential timeouts.

Network congestion and CPU saturation on critical services also appear early as bottlenecks.

Scaling Solutions for Graceful Degradation
  • Circuit Breakers: Automatically stop calls to failing services to prevent cascading failures.
  • Fallbacks: Serve cached or default data when a service is slow or down.
  • Feature Flags: Disable non-critical features dynamically to reduce load.
  • Load Shedding: Reject or delay low priority requests during overload.
  • Horizontal Scaling: Add more instances of critical services to handle load.
  • Asynchronous Processing: Queue requests for heavy operations to smooth spikes.
  • CDN and Caching: Offload static content and cache responses to reduce backend load.
Back-of-Envelope Cost Analysis

Assuming 1 million users with 1 request per second each:

  • Requests per second: 1,000,000 QPS total.
  • Single service capacity: One instance handles ~5,000 QPS.
  • Instances needed: ~200 instances for critical services.
  • Database load: 10,000 QPS max per instance; use read replicas and caching.
  • Network bandwidth: 1 Gbps = 125 MB/s; estimate average request size to calculate total bandwidth.
  • Storage: Cache storage for fallback data; size depends on data freshness and volume.
Interview Tip

When discussing graceful degradation, start by identifying critical vs non-critical services. Explain how to detect overload and failures early. Describe fallback mechanisms and circuit breakers clearly. Show understanding of user experience impact and how to prioritize features. Discuss trade-offs between availability and consistency. Use real examples like disabling image loading or showing cached data.

Self-Check Question

Your database handles 1000 QPS. Traffic grows 10x to 10,000 QPS. What do you do first?

Answer: Add read replicas and implement caching to reduce direct database load before scaling vertically or horizontally.

Key Result
Graceful degradation helps microservices maintain core functionality under heavy load by disabling or limiting non-critical features, using fallbacks, and preventing cascading failures with circuit breakers.

Practice

(1/5)
1. What is the main goal of graceful degradation in microservices?
easy
A. To increase the number of microservices for better scaling
B. To immediately stop all services when one fails
C. To keep the system running with reduced functionality during failures
D. To replace microservices with a monolithic architecture

Solution

  1. Step 1: Understand the concept of graceful degradation

    Graceful degradation means the system continues to work even if some parts fail, but with limited features.
  2. Step 2: Identify the goal in microservices context

    In microservices, it ensures users still get responses, possibly simpler or fallback, instead of total failure.
  3. Final Answer:

    To keep the system running with reduced functionality during failures -> Option C
  4. Quick Check:

    Graceful degradation = reduced functionality during failure [OK]
Hint: Graceful degradation means partial working, not full stop [OK]
Common Mistakes:
  • Thinking graceful degradation means full system shutdown
  • Confusing graceful degradation with scaling techniques
  • Assuming it replaces microservices with monolith
2. Which of the following is a correct way to implement graceful degradation in a microservice call?
easy
A. Restart the entire microservice cluster immediately
B. Return an error and stop the entire request flow
C. Ignore the failure and return no response
D. Use a fallback response when the called service is unavailable

Solution

  1. Step 1: Identify how graceful degradation handles failures

    It uses fallback responses or simpler data to keep the system responsive.
  2. Step 2: Match the option that uses fallback

    Use a fallback response when the called service is unavailable describes using fallback response when a service is down, which is correct.
  3. Final Answer:

    Use a fallback response when the called service is unavailable -> Option D
  4. Quick Check:

    Fallback response = graceful degradation [OK]
Hint: Fallback response is key to graceful degradation [OK]
Common Mistakes:
  • Stopping entire request instead of fallback
  • Ignoring failure without response
  • Restarting cluster is not graceful degradation
3. Consider this pseudocode for a microservice call with graceful degradation:
response = callService()
if response == null:
    response = getCachedData()
return response

What will be returned if callService() fails?
medium
A. Cached data as fallback
B. Null value
C. An error message
D. Empty string

Solution

  1. Step 1: Analyze the code flow when callService() fails

    If callService() returns null (failure), the code fetches cached data as fallback.
  2. Step 2: Determine the returned value

    The fallback cached data is returned instead of null or error.
  3. Final Answer:

    Cached data as fallback -> Option A
  4. Quick Check:

    Fallback cached data returned on failure [OK]
Hint: Null response triggers fallback to cached data [OK]
Common Mistakes:
  • Assuming error message is returned
  • Thinking null is returned directly
  • Confusing empty string with fallback data
4. A microservice uses this code snippet for graceful degradation:
try {
  data = fetchFromService()
} catch (Exception e) {
  data = null
}
return data.toString()

What is the main problem with this code?
medium
A. It does not handle exceptions properly
B. It returns null.toString() causing a runtime error
C. It always returns an empty string
D. It retries the service call infinitely

Solution

  1. Step 1: Understand exception handling and return statement

    If fetchFromService() fails, data is set to null, then data.toString() is called.
  2. Step 2: Identify the error caused by calling toString() on null

    Calling toString() on null causes a runtime NullPointerException or similar error.
  3. Final Answer:

    It returns null.toString() causing a runtime error -> Option B
  4. Quick Check:

    Calling toString() on null causes error [OK]
Hint: Calling method on null causes runtime error [OK]
Common Mistakes:
  • Ignoring null check before toString()
  • Assuming exception is handled fully
  • Thinking it retries infinitely
5. You design a microservice system where the payment service may fail. To apply graceful degradation, which approach is best?
hard
A. Return a simplified confirmation without payment details and log failure for retry
B. Block the entire order process until payment service recovers
C. Send an error response to the user immediately without fallback
D. Remove the payment service and process orders without payment

Solution

  1. Step 1: Understand graceful degradation for critical service failure

    When payment service fails, system should still respond with limited info, not block or error out.
  2. Step 2: Evaluate options for best graceful degradation

    Return a simplified confirmation without payment details and log failure for retry returns simplified confirmation and logs failure for retry, maintaining user experience and system reliability.
  3. Final Answer:

    Return a simplified confirmation without payment details and log failure for retry -> Option A
  4. Quick Check:

    Simplified response + retry = graceful degradation [OK]
Hint: Simplify response and log failure for retry [OK]
Common Mistakes:
  • Blocking entire process on failure
  • Sending immediate error without fallback
  • Removing critical service entirely