Overview - Active-passive vs active-active

What is it?

Active-passive and active-active are two ways to set up systems for high availability and fault tolerance. In active-passive, one system is active and handles all work while the other waits silently to take over if the active one fails. In active-active, multiple systems run simultaneously, sharing the workload and providing backup for each other. These setups help keep services running smoothly even if parts fail.

Why it matters

Without these setups, if a system fails, services can stop working, causing downtime and unhappy users. Active-passive ensures a backup is ready but can waste resources waiting. Active-active uses resources efficiently and improves performance but is more complex. Choosing the right setup affects reliability, cost, and user experience.

Where it fits

Learners should understand basic distributed systems and fault tolerance concepts before this. After this, they can explore specific Kafka configurations for replication and failover, and advanced topics like multi-region Kafka clusters and disaster recovery.

Mental Model

Core Idea

Active-passive uses one system at a time with a standby backup, while active-active runs multiple systems together sharing work and backup.

Think of it like...

It's like having one driver and a backup driver waiting in the car (active-passive) versus having two drivers driving side by side, both steering and ready to cover for each other instantly (active-active).

┌───────────────┐       ┌───────────────┐
│ Active System │──────▶│ Handles Work  │
└───────────────┘       └───────────────┘
        │
        │ Failover
        ▼
┌───────────────┐       ┌───────────────┐
│ Passive System│       │ Standby Ready │
└───────────────┘       └───────────────┘


Active-Active Setup:

┌───────────────┐   ┌───────────────┐
│ Active System │ ◀▶│ Active System │
│      #1       │   │      #2       │
└───────────────┘   └───────────────┘
       │                  │
       └─────▶ Shared Workload ◀─────┘

Build-Up - 7 Steps

1

FoundationUnderstanding system availability basics

Concept: Introduce what availability means and why systems need backups.

Availability means a system is ready and working when users need it. Systems can fail due to hardware, software, or network issues. To avoid downtime, backups or duplicates are used to take over if the main system fails.

Result

Learners understand why systems need to be designed to handle failures without stopping service.

Knowing why availability matters helps appreciate why active-passive and active-active setups exist.

2

FoundationIntroducing active-passive setup

3

IntermediateExploring active-active setup

4

IntermediateFailover mechanisms in active-passive

5

IntermediateData consistency challenges in active-active

6

AdvancedKafka's approach to active-passive and active-active

7

ExpertSurprises and pitfalls in active-active Kafka clusters

Under the Hood

Kafka divides data into partitions, each with one leader broker handling all writes and reads (active). Followers replicate data but do not serve clients (passive). If the leader fails, Kafka elects a new leader from followers. This leader election is coordinated by ZooKeeper or Kafka's own quorum system. Active-active setups involve multiple Kafka clusters replicating data asynchronously or synchronously, requiring conflict resolution and careful coordination.

Why designed this way?

Kafka's active-passive leader-follower model balances simplicity, performance, and fault tolerance. It avoids complex consensus on every write, improving speed. Active-active setups are more complex and were designed later to support multi-region and disaster recovery needs. The tradeoff is between simplicity and availability/performance at scale.

Kafka Partition Replication:

┌───────────────┐
│   Leader      │  ← Active: handles client requests
│   Broker      │
└───────────────┘
       │ Replicates
       ▼
┌───────────────┐
│   Follower    │  ← Passive: replicates data, standby leader
│   Broker      │
└───────────────┘

Leader Election Flow:

┌───────────────┐
│ Detect Failure│
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Elect New     │
│ Leader Broker │
└───────────────┘
       │
       ▼
┌───────────────┐
│ Resume Active │
│ Operations    │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does active-active always mean zero downtime? Commit yes or no.

Common Belief:Active-active setups guarantee zero downtime and no data loss.

Tap to reveal reality

Quick: Is the passive system in active-passive always idle? Commit yes or no.

Common Belief:The passive system in active-passive setups does nothing until failover.

Tap to reveal reality

Quick: Does Kafka use active-active by default? Commit yes or no.

Common Belief:Kafka clusters are active-active by default, handling writes on multiple brokers simultaneously.

Tap to reveal reality

Quick: Can active-passive setups scale performance easily? Commit yes or no.

Common Belief:Active-passive setups can scale performance by adding more passive systems.

Tap to reveal reality

Expert Zone

1

In active-active Kafka clusters, network partitions can cause split-brain scenarios that require careful quorum and leader election tuning to avoid data loss.

2

Active-passive failover timing is critical; too fast failover risks false positives, too slow increases downtime, so monitoring sensitivity must be balanced.

3

Kafka's ISR (in-sync replicas) mechanism ensures data durability but can cause availability tradeoffs if replicas lag or fail.

When NOT to use

Active-passive is not suitable when low latency and high throughput are critical because it uses standby resources inefficiently. Active-active is not recommended for simple setups or when data consistency is paramount without complex conflict resolution. Alternatives include sharding, load balancing, or cloud-managed multi-region services.

Production Patterns

Kafka commonly uses active-passive within a single cluster with leader-follower replication for partitions. Multi-region active-active setups use MirrorMaker or Confluent Replicator to asynchronously replicate data between clusters, balancing latency and consistency. Operators tune leader election, ISR settings, and monitoring to optimize failover and availability.

Connections

Distributed Consensus Algorithms

Active-active setups often rely on consensus algorithms like Raft or Paxos to maintain data consistency across nodes.

Understanding consensus helps grasp how active-active systems coordinate writes and avoid conflicts.

Load Balancing

Active-active systems share workload like load balancers distribute traffic across servers.

Knowing load balancing principles clarifies how active-active improves performance and availability.

Human Teamwork Dynamics

Active-passive and active-active mirror how teams work: one leader with backup versus multiple leaders collaborating.

Recognizing this helps understand coordination challenges and failover in technical systems.

Common Pitfalls

#1Failing to configure automatic failover in active-passive setups.

Wrong approach:Manual failover only: operator must detect failure and switch systems manually.

Correct approach:Set up monitoring and automatic failover tools to detect failure and switch instantly.

Root cause:Underestimating downtime impact and overestimating manual response speed.

#2Assuming active-active Kafka clusters do not need conflict resolution.

Wrong approach:Deploy multi-region Kafka clusters without configuring idempotent producers or quorum settings.

Correct approach:Use idempotent producers, configure quorum-based writes, and monitor for split-brain scenarios.

Root cause:Misunderstanding data consistency challenges in distributed active-active systems.

#3Using active-passive to scale performance by adding passive nodes.

Wrong approach:Add multiple passive brokers expecting them to share workload.

Correct approach:Use active-active or partitioning to distribute workload across active brokers.

Root cause:Confusing failover backup with load distribution.

Key Takeaways

Active-passive setups use one active system with a standby backup, providing simple failover but limited performance scaling.

Active-active setups run multiple systems simultaneously, sharing workload and backup duties, improving performance and availability but increasing complexity.

Kafka uses an active-passive leader-follower model per partition by default, with active-active requiring special multi-cluster configurations.

Failover mechanisms and data consistency are critical challenges that differ between active-passive and active-active setups.

Choosing between active-passive and active-active depends on tradeoffs among complexity, performance, availability, and consistency needs.