Overview - Cluster failover

What is it?

Cluster failover is a process in Redis where if a master node stops working, one of its replica nodes automatically takes over as the new master. This keeps the database available without manual intervention. It helps Redis clusters stay reliable and responsive even when some parts fail.

Why it matters

Without cluster failover, if a master node fails, the whole part of the database it manages would become unavailable until someone fixes it. This causes downtime and can break applications relying on Redis. Failover ensures continuous service and data availability, which is critical for real-time apps like messaging or caching.

Where it fits

Before learning cluster failover, you should understand Redis basics, how Redis clusters work, and the roles of master and replica nodes. After this, you can explore advanced topics like cluster rebalancing, consistency models, and monitoring Redis clusters in production.

Mental Model

Core Idea

Cluster failover is the automatic switch from a failed master node to a replica node to keep the Redis cluster running smoothly.

Think of it like...

Imagine a relay race where if the runner carrying the baton trips, the next runner immediately picks up the baton and continues running without stopping the race.

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│ Master Node │─────▶│ Replica 1   │
└─────────────┘      └─────────────┘      ┌─────────────┐
       │                                │ Replica 2   │
       │                                └─────────────┘
       ▼
  [Failure]
       ▼
┌─────────────────────────────┐
│ Failover triggers election   │
│ Replica 1 becomes new master │
└─────────────────────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding Redis Cluster Basics

Concept: Learn what a Redis cluster is and the roles of master and replica nodes.

A Redis cluster is a group of Redis servers working together to store data. Each master node holds a part of the data and can have one or more replicas that copy its data. Masters handle writes and reads, while replicas are backups that can serve reads and take over if a master fails.

Result

You know the basic structure of a Redis cluster and the difference between master and replica nodes.

Understanding the cluster structure is essential because failover depends on the relationship between masters and replicas.

2

FoundationWhat Happens When a Node Fails

3

IntermediateHow Failover Election Works

4

IntermediateRole of Configuration Epoch in Failover

5

IntermediateFailover Timing and Detection

6

AdvancedHandling Failover in Large Clusters

7

ExpertSurprises in Failover: Split-Brain and Manual Intervention

Under the Hood

Redis failover works by replicas monitoring the master using periodic pings. When a replica suspects failure, it communicates with other replicas and cluster nodes to confirm. A voting process elects a new master based on configuration epochs to avoid conflicts. The elected replica promotes itself to master, updates cluster state, and informs clients. This process uses Redis's gossip protocol for cluster communication and atomic state changes to ensure consistency.

Why designed this way?

Redis failover was designed to provide high availability with minimal downtime and no single point of failure. The voting and epoch system prevents split-brain scenarios common in distributed systems. Alternatives like manual failover or centralized coordinators were rejected because they add delays or single points of failure. Redis's decentralized approach balances speed, reliability, and simplicity.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Master Node   │──────▶│ Replica Node 1│
│ (Fails)      │       │               │
└───────────────┘       └───────────────┘       ┌───────────────┐
                         │                       │ Replica Node 2│
                         │                       └───────────────┘
                         ▼
               ┌───────────────────────────┐
               │ Replicas detect failure    │
               │ Communicate and vote       │
               └───────────────────────────┘
                         ▼
               ┌───────────────────────────┐
               │ Replica with highest epoch │
               │ becomes new master         │
               └───────────────────────────┘
                         ▼
               ┌───────────────────────────┐
               │ Cluster state updated      │
               │ Clients redirected         │
               └───────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does failover guarantee zero data loss? Commit to yes or no before reading on.

Common Belief:Failover always prevents any data loss because replicas have exact copies.

Tap to reveal reality

Quick: Do you think failover happens instantly the moment a master is unreachable? Commit to yes or no before reading on.

Common Belief:Failover triggers immediately as soon as a master stops responding.

Tap to reveal reality

Quick: Can a replica become master without any voting? Commit to yes or no before reading on.

Common Belief:A replica can promote itself to master as soon as it detects master failure.

Tap to reveal reality

Quick: Is manual intervention never needed in failover? Commit to yes or no before reading on.

Common Belief:Automatic failover handles all failure cases without human help.

Tap to reveal reality

Expert Zone

1

Failover timing parameters must be carefully tuned to balance fast recovery and false failover prevention, which varies by network conditions.

2

Configuration epochs are critical for cluster consistency but can cause confusion during manual cluster repairs if misunderstood.

3

Gossip protocol delays can affect failover speed in large or geographically distributed clusters, requiring monitoring and tuning.

When NOT to use

Cluster failover is not suitable when absolute zero data loss is required; in such cases, synchronous replication or external consensus systems like ZooKeeper should be used. Also, for very small or single-node Redis setups, failover is unnecessary and adds complexity.

Production Patterns

In production, Redis failover is combined with monitoring tools that alert on failover events, automated scripts to handle manual recovery steps, and careful parameter tuning. Operators often use Redis Sentinel or Redis Cluster mode with failover enabled, and integrate failover events with orchestration systems for smooth scaling and maintenance.

Connections

Distributed Consensus Algorithms

Cluster failover uses voting and leader election similar to consensus algorithms like Raft or Paxos.

Understanding consensus algorithms helps grasp how Redis prevents split-brain and ensures a single master during failover.

High Availability in Cloud Systems

Failover is a key technique in cloud systems to maintain service availability during failures.

Knowing failover in Redis connects to broader cloud strategies for fault tolerance and uptime.

Emergency Backup Generators

Failover is like switching to a backup power generator when the main power fails.

This cross-domain connection highlights the importance of automatic backup systems to maintain continuous operation.

Common Pitfalls

#1Failover triggers too quickly causing unnecessary master switches.

Wrong approach:Setting failover timeout to a very low value like 100ms, causing failover on brief network hiccups.

Correct approach:Configure failover timeout to a balanced value like 5 seconds to confirm failure before switching.

Root cause:Misunderstanding failover detection timing leads to instability and frequent failovers.

#2Assuming replicas always have up-to-date data before failover.

Wrong approach:Ignoring replication lag and promoting a replica that is behind the master.

Correct approach:Monitor replication lag and ensure replicas are synchronized before allowing failover.

Root cause:Overlooking replication delays causes data loss or stale data serving after failover.

#3Manually promoting a replica to master without cluster coordination.

Wrong approach:Using commands to force a replica to master without voting or updating cluster state.

Correct approach:Use Redis failover mechanisms or Sentinel to coordinate promotion and update cluster configuration.

Root cause:Ignoring cluster coordination causes split-brain and inconsistent cluster state.

Key Takeaways

Cluster failover in Redis automatically switches to a replica when a master fails to keep the database available.

Failover uses a voting process and configuration epochs to elect a new master and prevent conflicts.

Detection of failure is carefully timed to avoid false failovers and ensure cluster stability.

Despite automation, failover can require manual intervention in rare network partition cases.

Understanding failover deeply helps tune Redis clusters for reliability and prepares you for real-world challenges.