Overview - In-sync replicas (ISR)

What is it?

In Kafka, In-sync replicas (ISR) are the set of replicas of a partition that are fully caught up with the leader. These replicas have all the latest messages that the leader has acknowledged. ISR ensures data durability and availability by keeping multiple copies of data synchronized.

Why it matters

ISR exists to prevent data loss and maintain high availability in Kafka clusters. Without ISR, if a leader fails, followers might not have the latest data, causing message loss or inconsistent reads. This would make Kafka unreliable for critical data streaming applications.

Where it fits

Before learning about ISR, you should understand Kafka basics like partitions, leaders, and replicas. After ISR, you can explore Kafka's replication protocols, leader election, and fault tolerance mechanisms.

Mental Model

Core Idea

ISR is the group of replicas that have caught up with the leader and can safely take over without data loss.

Think of it like...

ISR is like a team of backup singers who always stay in perfect harmony with the lead singer, ready to step in without missing a beat if the lead stops singing.

Partition Leader
  │
  ├─ Replica 1 (In-sync)
  ├─ Replica 2 (In-sync)
  └─ Replica 3 (Out-of-sync)

ISR = {Replica 1, Replica 2}
Out-of-sync replicas are excluded from ISR until they catch up.

Build-Up - 7 Steps

1

FoundationKafka Replication Basics

Concept: Introduce Kafka partitions, leaders, and replicas.

Kafka stores data in partitions. Each partition has one leader and multiple replicas. The leader handles all reads and writes. Replicas copy data from the leader to provide fault tolerance.

Result

Learner understands the roles of leader and replicas in Kafka partitions.

Understanding leader and replicas is essential to grasp why synchronization matters for data safety.

2

FoundationWhat Does Synchronization Mean?

3

IntermediateDefining the ISR Set

4

IntermediateISR and Leader Election

5

IntermediateISR Shrinking and Expansion

6

AdvancedISR Impact on Acknowledgments

7

ExpertISR and Partition Reassignment Surprises

Under the Hood

Kafka tracks the offset of the last message each replica has replicated. The leader maintains a high watermark representing the highest offset committed by all ISR replicas. Replicas send periodic heartbeats and fetch requests to the leader. If a replica stops responding or falls behind beyond a configured threshold, the leader removes it from ISR. This mechanism ensures only replicas with up-to-date data are trusted.

Why designed this way?

Kafka was designed for high throughput and fault tolerance. Using ISR allows Kafka to balance data safety and availability by only trusting replicas that are fully caught up. Alternatives like waiting for all replicas would reduce availability, while ignoring replica lag would risk data loss.

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│  Leader     │─────▶│ Replica 1   │
│ (Partition) │      │ (In-sync)   │
└─────────────┘      └─────────────┘
       │                  ▲
       │                  │
       │                  │
       │          ┌─────────────┐
       │          │ Replica 2   │
       │          │ (In-sync)   │
       │          └─────────────┘
       │                  ▲
       │                  │
       │          ┌─────────────┐
       │          │ Replica 3   │
       └─────────▶│ (Out-of-sync)│
                  └─────────────┘

ISR = {Replica 1, Replica 2}
Leader tracks offsets and heartbeats to maintain ISR.

Myth Busters - 4 Common Misconceptions

Quick: Do you think all replicas in Kafka are always in-sync? Commit yes or no.

Common Belief:All replicas are always in-sync with the leader.

Tap to reveal reality

Quick: Can a replica outside ISR become leader during failover? Commit yes or no.

Common Belief:Any replica can become leader regardless of sync status.

Tap to reveal reality

Quick: Does ISR size always stay constant? Commit yes or no.

Common Belief:ISR size is fixed and does not change dynamically.

Tap to reveal reality

Quick: Does ISR guarantee zero data loss in all failure cases? Commit yes or no.

Common Belief:ISR guarantees zero data loss no matter what.

Tap to reveal reality

Expert Zone

1

ISR membership depends on replica fetch latency and heartbeat intervals, so network issues can cause temporary ISR shrinkage.

2

The leader's high watermark advances only when all ISR replicas acknowledge, affecting consumer visibility of messages.

3

During partition reassignment, ISR resets and replicas must catch up, which can temporarily reduce fault tolerance.

When NOT to use

ISR is not applicable when using Kafka in non-replicated mode or with unclean leader election enabled, which sacrifices data safety for availability. In such cases, alternatives like external replication or backup systems should be used.

Production Patterns

In production, operators monitor ISR size to detect replica lag or failures. They tune replica.lag.time.max.ms and min.insync.replicas to balance durability and availability. During maintenance, careful partition reassignment is done to avoid ISR shrinkage causing downtime.

Connections

Consensus Algorithms (e.g., Raft, Paxos)

ISR is similar to quorum sets in consensus algorithms that ensure agreement among nodes before committing data.

Understanding ISR helps grasp how distributed systems achieve consistency and fault tolerance through majority agreement.

Database Replication

ISR parallels synchronous replication in databases where replicas must confirm writes before commit.

Knowing ISR clarifies tradeoffs between synchronous and asynchronous replication in data durability.

Team Backup Systems

ISR resembles backup team members who must be fully prepared before taking over a task.

This cross-domain view highlights the importance of readiness and synchronization in reliable failover.

Common Pitfalls

#1Assuming all replicas are always in ISR and ignoring replica lag.

Wrong approach:Setting min.insync.replicas=3 with only 2 replicas in ISR, expecting full durability.

Correct approach:Monitor ISR size and adjust min.insync.replicas to match actual ISR count to avoid write failures.

Root cause:Misunderstanding that ISR is dynamic and can shrink under load or network issues.

#2Enabling unclean leader election and ignoring ISR implications.

Wrong approach:Setting unclean.leader.election.enable=true to improve availability without considering data loss.

Correct approach:Keep unclean.leader.election.enable=false to ensure only ISR replicas become leaders, preserving data safety.

Root cause:Lack of awareness that unclean leader election can cause data loss by electing out-of-sync replicas.

#3Ignoring ISR changes during partition reassignment.

Wrong approach:Reassigning partitions without monitoring ISR, causing ISR to shrink and availability to drop.

Correct approach:Perform reassignment gradually and monitor ISR to ensure replicas catch up before leader switches.

Root cause:Not understanding that new replicas start empty and must rejoin ISR to maintain fault tolerance.

Key Takeaways

In-sync replicas (ISR) are the set of Kafka replicas fully caught up with the leader, ensuring data safety.

Only replicas in ISR can become leaders, preventing data loss during failover.

ISR membership changes dynamically based on replica health and synchronization status.

ISR size affects write durability and latency through producer acknowledgment settings.

Understanding ISR behavior during partition reassignment and failures is critical for reliable Kafka operations.