Overview - Retention policies (time-based, size-based)

What is it?

Retention policies in Kafka control how long or how much data is kept in a topic before it is deleted. Time-based retention deletes messages older than a set time, while size-based retention deletes messages when the topic's data exceeds a set size. These policies help manage storage and ensure Kafka does not run out of space. They work automatically in the background without manual intervention.

Why it matters

Without retention policies, Kafka topics could grow endlessly, filling up disk space and causing system failures. Retention policies keep data manageable and predictable, allowing Kafka to run smoothly and reliably. They also help balance between keeping enough data for consumers and freeing up resources. This makes Kafka practical for real-world use where data volume is large and continuous.

Where it fits

Before learning retention policies, you should understand Kafka topics, partitions, and how Kafka stores messages. After mastering retention policies, you can explore Kafka compaction, consumer groups, and data cleanup strategies. Retention policies are part of Kafka's data lifecycle management.

Mental Model

Core Idea

Retention policies automatically remove old or excess data from Kafka topics to keep storage under control and ensure system stability.

Think of it like...

It's like a refrigerator that automatically throws away food after a certain date or when it gets too full, so it never overflows and always has space for fresh items.

┌───────────────────────────────┐
│         Kafka Topic           │
│ ┌───────────────┐             │
│ │ Messages     │             │
│ │ ┌───────────┐ │             │
│ │ │ Time-based│ │             │
│ │ │ Retention │ │             │
│ │ └───────────┘ │             │
│ │ ┌───────────┐ │             │
│ │ │ Size-based│ │             │
│ │ │ Retention │ │             │
│ │ └───────────┘ │             │
│ └───────────────┘             │
└───────────────────────────────┘

Build-Up - 7 Steps

1

FoundationWhat is Kafka Retention Policy

Concept: Introduction to the idea that Kafka deletes old data automatically based on rules.

Kafka stores messages in topics. Without limits, these topics grow forever. Retention policies tell Kafka when to delete old messages to save space. Two main types exist: time-based and size-based retention.

Result

Learner understands that retention policies prevent infinite data growth in Kafka topics.

Knowing that Kafka manages data lifecycle automatically helps avoid manual cleanup and system crashes.

2

FoundationDifference Between Time and Size Retention

3

IntermediateConfiguring Time-Based Retention in Kafka

4

IntermediateConfiguring Size-Based Retention in Kafka

5

IntermediateHow Kafka Deletes Messages Internally

6

AdvancedCombining Time and Size Retention Policies

7

ExpertRetention Policy Impact on Consumer Behavior

Under the Hood

Kafka stores messages in log segments on disk per partition. Each segment is a file with messages sorted by offset. The retention cleaner scans these segments periodically. If a segment's messages are all older than retention.ms or the total size exceeds retention.bytes, the entire segment file is deleted. Partial segments are kept until fully expired. This design optimizes disk I/O and avoids deleting individual messages, which would be inefficient.

Why designed this way?

Segment-based retention was chosen to optimize performance and reduce overhead. Deleting whole files is faster and simpler than deleting individual messages. Also, Kafka's append-only log design fits well with segment deletion. Alternatives like per-message deletion would slow down writes and complicate storage management. The dual retention policies provide flexible control for different use cases.

┌───────────────┐
│ Kafka Topic   │
│ ┌───────────┐ │
│ │ Partition │ │
│ │ ┌───────┐ │ │
│ │ │Segment│ │ │
│ │ │ Files │ │ │
│ │ └───────┘ │ │
│ └───────────┘ │
└─────┬─────────┘
      │ Cleaner scans segments
      ▼
┌─────────────────────────┐
│ Check segment age & size│
│ If expired, delete file │
└─────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does retention.ms delete messages exactly at the time limit or sometime after? Commit to yes or no.

Common Belief:Retention.ms deletes messages exactly when they reach the time limit.

Tap to reveal reality

Quick: Does retention.bytes limit the total topic size or per partition? Commit to one.

Common Belief:Retention.bytes limits the total size of the entire topic.

Tap to reveal reality

Quick: Does retention policy affect compacted topics the same way as normal topics? Commit to yes or no.

Common Belief:Retention policies delete messages in compacted topics just like normal topics.

Tap to reveal reality

Quick: Can retention policies cause consumers to lose messages if they read late? Commit to yes or no.

Common Belief:Retention policies only affect storage and do not impact consumer message availability.

Tap to reveal reality

Expert Zone

1

Retention policies operate at the partition segment level, so message deletion is approximate and depends on segment boundaries.

2

Setting very low retention.ms or retention.bytes can cause frequent segment deletions, impacting performance and consumer lag.

3

Compacted topics combine retention with key-based cleanup, requiring careful tuning to avoid unintended data loss.

When NOT to use

Retention policies are not suitable when you need to keep all data indefinitely or require precise message-level deletion. In such cases, use Kafka log compaction or external storage systems like HDFS or cloud storage for archiving.

Production Patterns

In production, teams set retention.ms to balance data freshness and storage cost, often using size-based retention to prevent disk overflow. They monitor topic sizes and consumer lag to adjust policies dynamically. Compacted topics are used for changelog or state data, while time/size retention is used for event streams.

Connections

Database Archiving

Both manage data lifecycle by removing old or excess data to save space.

Understanding retention policies in Kafka helps grasp how databases archive or purge old records to maintain performance.

Garbage Collection in Programming

Retention policies are like garbage collectors that remove unused data automatically.

Knowing how garbage collection frees memory clarifies why Kafka deletes whole segments instead of individual messages for efficiency.

Refrigerator Food Management

Both use time and space limits to decide when to discard items.

This cross-domain insight shows how everyday systems balance freshness and capacity, similar to Kafka's data retention.

Common Pitfalls

#1Setting retention.ms too low causing premature data loss.

Wrong approach:retention.ms=60000 # 1 minute retention

Correct approach:retention.ms=604800000 # 7 days retention

Root cause:Misunderstanding the time unit (milliseconds) and setting too small a value leads to losing data consumers haven't processed yet.

#2Assuming retention.bytes limits total topic size instead of per partition.

Wrong approach:retention.bytes=1073741824 # expecting 1GB total topic size

Correct approach:retention.bytes=1073741824 # actually 1GB per partition, adjust partition count accordingly

Root cause:Not knowing retention.bytes applies per partition causes storage planning errors and unexpected disk usage.

#3Expecting retention policies to delete messages immediately at expiration time.

Wrong approach:Believing messages older than retention.ms are deleted instantly.

Correct approach:Understanding that deletion happens when entire segments expire, so some old messages may remain briefly.

Root cause:Ignoring Kafka's segment-based storage model leads to wrong expectations about deletion timing.

Key Takeaways

Kafka retention policies automatically delete old or excess data to control storage and system health.

Time-based retention deletes messages older than a set time, while size-based retention deletes when partition size exceeds a limit.

Retention policies operate at the segment level, so deletion timing is approximate, not exact per message.

Retention.bytes limits size per partition, not the whole topic, which affects storage planning.

Retention policies impact data availability for consumers, so settings must balance storage and processing needs.