Overview - Failover manual process

What is it?

Failover manual process in Redis is the step-by-step method to switch from a failed primary server to a backup server by hand. It ensures the system keeps working even if the main Redis server stops responding. This process involves promoting a replica to become the new primary and redirecting clients to it. It is done without automatic tools, requiring human intervention.

Why it matters

Without failover, if the primary Redis server crashes, the whole application relying on it can stop working, causing downtime and lost data access. Manual failover allows quick recovery by switching to a backup server, keeping services running smoothly. It is crucial for systems that cannot afford long interruptions and need reliable data availability.

Where it fits

Before learning manual failover, you should understand Redis basics like primary and replica roles, and how data replication works. After mastering manual failover, you can explore automatic failover tools like Redis Sentinel or Redis Cluster for more advanced, hands-off recovery.

Mental Model

Core Idea

Manual failover is the human-controlled switch from a failed Redis primary server to a replica to keep data available and services running.

Think of it like...

It's like having a backup generator at home that you turn on yourself when the main power goes out, ensuring your lights stay on until the main power is fixed.

┌───────────────┐       ┌───────────────┐
│ Primary Redis │──────▶│ Clients       │
└──────┬────────┘       └───────────────┘
       │
       │ Replication
       ▼
┌───────────────┐
│ Replica Redis │
└───────────────┘

Manual failover steps:
1. Detect primary failure
2. Promote replica to primary
3. Redirect clients to new primary

Build-Up - 6 Steps

1

FoundationUnderstanding Redis Primary and Replica

Concept: Learn the roles of primary and replica servers in Redis and how data is copied.

Redis uses a primary server to handle all writes and replicas to copy data from the primary. Replicas keep a copy of the data to help with read scaling and backup. If the primary fails, a replica can take over to keep data available.

Result

You know the difference between primary and replica and why replicas exist.

Understanding these roles is essential because failover means switching these roles manually.

2

FoundationDetecting Primary Server Failure

3

IntermediatePromoting Replica to Primary Manually

4

IntermediateRedirecting Clients to New Primary

5

AdvancedReconfiguring Old Primary After Recovery

6

ExpertHandling Data Consistency and Split-Brain Risks

Under the Hood

Redis replication works by the primary sending a stream of commands to replicas to keep data in sync. When a replica is promoted, it stops receiving commands from the old primary and starts accepting writes. Clients must then connect to the new primary to continue operations. The manual process requires human commands to change roles and update clients.

Why designed this way?

Manual failover exists because automatic failover tools may not be available or desired in some setups. It gives full control to operators to decide when and how to switch roles, avoiding unexpected changes. Historically, Redis started with simple replication and manual failover before tools like Sentinel were created.

┌───────────────┐          ┌───────────────┐
│ Old Primary   │          │ Replica       │
│ (Failed)      │          │ (Promoted)    │
└──────┬────────┘          └──────┬────────┘
       │ Replication stops           │ Accepts writes
       │                            ▼
       │                     ┌───────────────┐
       │                     │ Clients       │
       │                     └───────────────┘
       ▼
┌───────────────┐
│ Reconfigured  │
│ as Replica    │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Do you think clients automatically switch to the new primary after manual failover? Commit yes or no.

Common Belief:Clients automatically detect and connect to the new primary after failover.

Tap to reveal reality

Quick: Do you think promoting a replica requires restarting Redis? Commit yes or no.

Common Belief:You must restart the replica server to promote it to primary.

Tap to reveal reality

Quick: Do you think manual failover guarantees no data loss? Commit yes or no.

Common Belief:Manual failover always preserves all data without loss.

Tap to reveal reality

Quick: Do you think the old primary automatically becomes a replica after failover? Commit yes or no.

Common Belief:The old primary automatically switches to replica mode after failover.

Tap to reveal reality

Expert Zone

1

Manual failover requires precise timing to avoid split-brain, which is often overlooked by beginners.

2

Network partitions can cause false failure detection, making manual failover risky without proper checks.

3

Reconfiguring clients can be complex in distributed systems and often requires orchestration tools.

When NOT to use

Manual failover is not suitable for large-scale or highly available systems where downtime must be minimal. Instead, use Redis Sentinel or Redis Cluster for automatic failover and monitoring.

Production Patterns

In production, manual failover is often used as a last resort or in simple setups. Operators script the process with automation tools and combine it with monitoring alerts to reduce human error and downtime.

Connections

Distributed Systems Consensus

Manual failover relates to consensus by requiring agreement on which node is primary to avoid conflicts.

Understanding consensus algorithms like Raft or Paxos helps grasp why failover coordination is critical to prevent split-brain.

Load Balancing

Failover involves redirecting clients similar to how load balancers distribute traffic among servers.

Knowing load balancing concepts clarifies how client redirection after failover maintains service availability.

Emergency Power Systems

Manual failover is like switching to a backup generator during power failure.

Recognizing this connection highlights the importance of readiness and manual control in critical system recovery.

Common Pitfalls

#1Failing to promote the replica before redirecting clients.

Wrong approach:Update client configs to new replica IP before running 'SLAVEOF NO ONE' on replica.

Correct approach:First run 'SLAVEOF NO ONE' on replica to promote it, then update client configs.

Root cause:Misunderstanding the order causes clients to connect to a replica still in read-only mode, leading to errors.

#2Not reconfiguring the old primary after recovery.

Wrong approach:Leave the old primary running as primary after failover without changes.

Correct approach:Run 'SLAVEOF ' on old primary to make it a replica.

Root cause:Assuming the old primary automatically switches roles causes data conflicts and split-brain.

#3Assuming clients automatically reconnect to new primary.

Wrong approach:Do nothing to client configs after failover.

Correct approach:Manually update client connection settings or DNS to point to new primary.

Root cause:Not knowing clients do not auto-discover new primary leads to prolonged downtime.

Key Takeaways

Manual failover in Redis is a human-driven process to switch roles between primary and replica servers to maintain availability.

Detecting primary failure quickly and promoting a replica with the 'SLAVEOF NO ONE' command are key steps.

Clients must be manually redirected to the new primary to avoid connection errors.

Careful coordination is needed to avoid data loss and split-brain scenarios during failover.

Manual failover is useful for simple setups but has limits; automatic tools like Sentinel are better for production.