0
0
Kafkadevops~15 mins

In-sync replicas (ISR) in Kafka - Deep Dive

Choose your learning style9 modes available
Overview - In-sync replicas (ISR)
What is it?
In Kafka, In-sync replicas (ISR) are the set of replicas of a partition that are fully caught up with the leader. These replicas have all the latest messages that the leader has acknowledged. ISR ensures data durability and availability by keeping multiple copies of data synchronized.
Why it matters
ISR exists to prevent data loss and maintain high availability in Kafka clusters. Without ISR, if a leader fails, followers might not have the latest data, causing message loss or inconsistent reads. This would make Kafka unreliable for critical data streaming applications.
Where it fits
Before learning about ISR, you should understand Kafka basics like partitions, leaders, and replicas. After ISR, you can explore Kafka's replication protocols, leader election, and fault tolerance mechanisms.
Mental Model
Core Idea
ISR is the group of replicas that have caught up with the leader and can safely take over without data loss.
Think of it like...
ISR is like a team of backup singers who always stay in perfect harmony with the lead singer, ready to step in without missing a beat if the lead stops singing.
Partition Leader
  │
  ├─ Replica 1 (In-sync)
  ├─ Replica 2 (In-sync)
  └─ Replica 3 (Out-of-sync)

ISR = {Replica 1, Replica 2}
Out-of-sync replicas are excluded from ISR until they catch up.
Build-Up - 7 Steps
1
FoundationKafka Replication Basics
🤔
Concept: Introduce Kafka partitions, leaders, and replicas.
Kafka stores data in partitions. Each partition has one leader and multiple replicas. The leader handles all reads and writes. Replicas copy data from the leader to provide fault tolerance.
Result
Learner understands the roles of leader and replicas in Kafka partitions.
Understanding leader and replicas is essential to grasp why synchronization matters for data safety.
2
FoundationWhat Does Synchronization Mean?
🤔
Concept: Explain what it means for replicas to be synchronized with the leader.
A replica is synchronized if it has copied all messages the leader has committed. If a replica lags behind, it is out-of-sync and cannot guarantee it has the latest data.
Result
Learner knows the difference between in-sync and out-of-sync replicas.
Knowing synchronization status helps understand how Kafka decides which replicas can safely serve data.
3
IntermediateDefining the ISR Set
🤔Before reading on: do you think all replicas are always in the ISR? Commit to yes or no.
Concept: Introduce the ISR as the set of replicas fully caught up with the leader.
Kafka maintains an ISR list per partition. Only replicas that have fully caught up with the leader are in this list. Replicas that fall behind are removed until they catch up again.
Result
Learner understands that ISR is a dynamic set reflecting replica health.
Recognizing ISR as a dynamic group clarifies how Kafka manages data consistency and availability.
4
IntermediateISR and Leader Election
🤔Before reading on: do you think Kafka can elect any replica as leader or only those in ISR? Commit to your answer.
Concept: Explain that only replicas in ISR can become leaders during failover.
When a leader fails, Kafka chooses a new leader from the ISR. This ensures the new leader has all committed data, preventing data loss.
Result
Learner sees how ISR protects data during leader changes.
Knowing leader election depends on ISR helps understand Kafka's fault tolerance guarantees.
5
IntermediateISR Shrinking and Expansion
🤔
Concept: Show how replicas join or leave ISR based on their sync status.
If a replica falls behind due to network or load, Kafka removes it from ISR. When it catches up, Kafka adds it back. This keeps ISR accurate and reliable.
Result
Learner understands ISR updates reflect real-time replica health.
Understanding ISR dynamics helps predict Kafka behavior under load or failures.
6
AdvancedISR Impact on Acknowledgments
🤔Before reading on: do you think Kafka waits for all replicas or only ISR replicas to acknowledge writes? Commit to your answer.
Concept: Explain how ISR affects message acknowledgment and durability.
Kafka's producer acks can be configured. For 'acks=all', Kafka waits for all ISR replicas to confirm writes before acknowledging. This ensures data is safely replicated.
Result
Learner knows how ISR size affects write durability and latency.
Understanding ISR's role in acknowledgments reveals the tradeoff between safety and performance.
7
ExpertISR and Partition Reassignment Surprises
🤔Before reading on: do you think ISR is preserved during partition reassignment? Commit to yes or no.
Concept: Discuss how ISR behaves during partition reassignment and broker restarts.
During partition reassignment, ISR can shrink because new replicas start empty and must catch up. Also, if a broker restarts slowly, its replicas may be removed from ISR, affecting availability.
Result
Learner understands ISR can fluctuate unexpectedly in production.
Knowing ISR behavior during reassignment helps prevent downtime and data loss surprises.
Under the Hood
Kafka tracks the offset of the last message each replica has replicated. The leader maintains a high watermark representing the highest offset committed by all ISR replicas. Replicas send periodic heartbeats and fetch requests to the leader. If a replica stops responding or falls behind beyond a configured threshold, the leader removes it from ISR. This mechanism ensures only replicas with up-to-date data are trusted.
Why designed this way?
Kafka was designed for high throughput and fault tolerance. Using ISR allows Kafka to balance data safety and availability by only trusting replicas that are fully caught up. Alternatives like waiting for all replicas would reduce availability, while ignoring replica lag would risk data loss.
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│  Leader     │─────▶│ Replica 1   │
│ (Partition) │      │ (In-sync)   │
└─────────────┘      └─────────────┘
       │                  ▲
       │                  │
       │                  │
       │          ┌─────────────┐
       │          │ Replica 2   │
       │          │ (In-sync)   │
       │          └─────────────┘
       │                  ▲
       │                  │
       │          ┌─────────────┐
       │          │ Replica 3   │
       └─────────▶│ (Out-of-sync)│
                  └─────────────┘

ISR = {Replica 1, Replica 2}
Leader tracks offsets and heartbeats to maintain ISR.
Myth Busters - 4 Common Misconceptions
Quick: Do you think all replicas in Kafka are always in-sync? Commit yes or no.
Common Belief:All replicas are always in-sync with the leader.
Tap to reveal reality
Reality:Only replicas that have fully caught up are in the ISR; others are excluded until they catch up.
Why it matters:Assuming all replicas are in-sync can lead to overestimating data safety and cause unexpected data loss during failover.
Quick: Can a replica outside ISR become leader during failover? Commit yes or no.
Common Belief:Any replica can become leader regardless of sync status.
Tap to reveal reality
Reality:Only replicas in ISR can be elected leader to ensure no data loss.
Why it matters:Believing otherwise risks choosing a replica missing data, causing inconsistent reads and data loss.
Quick: Does ISR size always stay constant? Commit yes or no.
Common Belief:ISR size is fixed and does not change dynamically.
Tap to reveal reality
Reality:ISR size changes as replicas fall behind or catch up, reflecting real-time health.
Why it matters:Ignoring ISR dynamics can cause misinterpretation of cluster health and lead to wrong operational decisions.
Quick: Does ISR guarantee zero data loss in all failure cases? Commit yes or no.
Common Belief:ISR guarantees zero data loss no matter what.
Tap to reveal reality
Reality:ISR reduces data loss risk but cannot guarantee zero loss if misconfigured or under extreme failures.
Why it matters:Overtrusting ISR can cause complacency in backup and monitoring strategies, risking data integrity.
Expert Zone
1
ISR membership depends on replica fetch latency and heartbeat intervals, so network issues can cause temporary ISR shrinkage.
2
The leader's high watermark advances only when all ISR replicas acknowledge, affecting consumer visibility of messages.
3
During partition reassignment, ISR resets and replicas must catch up, which can temporarily reduce fault tolerance.
When NOT to use
ISR is not applicable when using Kafka in non-replicated mode or with unclean leader election enabled, which sacrifices data safety for availability. In such cases, alternatives like external replication or backup systems should be used.
Production Patterns
In production, operators monitor ISR size to detect replica lag or failures. They tune replica.lag.time.max.ms and min.insync.replicas to balance durability and availability. During maintenance, careful partition reassignment is done to avoid ISR shrinkage causing downtime.
Connections
Consensus Algorithms (e.g., Raft, Paxos)
ISR is similar to quorum sets in consensus algorithms that ensure agreement among nodes before committing data.
Understanding ISR helps grasp how distributed systems achieve consistency and fault tolerance through majority agreement.
Database Replication
ISR parallels synchronous replication in databases where replicas must confirm writes before commit.
Knowing ISR clarifies tradeoffs between synchronous and asynchronous replication in data durability.
Team Backup Systems
ISR resembles backup team members who must be fully prepared before taking over a task.
This cross-domain view highlights the importance of readiness and synchronization in reliable failover.
Common Pitfalls
#1Assuming all replicas are always in ISR and ignoring replica lag.
Wrong approach:Setting min.insync.replicas=3 with only 2 replicas in ISR, expecting full durability.
Correct approach:Monitor ISR size and adjust min.insync.replicas to match actual ISR count to avoid write failures.
Root cause:Misunderstanding that ISR is dynamic and can shrink under load or network issues.
#2Enabling unclean leader election and ignoring ISR implications.
Wrong approach:Setting unclean.leader.election.enable=true to improve availability without considering data loss.
Correct approach:Keep unclean.leader.election.enable=false to ensure only ISR replicas become leaders, preserving data safety.
Root cause:Lack of awareness that unclean leader election can cause data loss by electing out-of-sync replicas.
#3Ignoring ISR changes during partition reassignment.
Wrong approach:Reassigning partitions without monitoring ISR, causing ISR to shrink and availability to drop.
Correct approach:Perform reassignment gradually and monitor ISR to ensure replicas catch up before leader switches.
Root cause:Not understanding that new replicas start empty and must rejoin ISR to maintain fault tolerance.
Key Takeaways
In-sync replicas (ISR) are the set of Kafka replicas fully caught up with the leader, ensuring data safety.
Only replicas in ISR can become leaders, preventing data loss during failover.
ISR membership changes dynamically based on replica health and synchronization status.
ISR size affects write durability and latency through producer acknowledgment settings.
Understanding ISR behavior during partition reassignment and failures is critical for reliable Kafka operations.