Overview - Log compaction

What is it?

Log compaction is a feature in Apache Kafka that keeps the latest value for each unique key in a topic. Instead of deleting old messages based on time or size, it removes older duplicates of the same key, ensuring only the most recent update remains. This helps maintain a compacted log that represents the current state of data. It is useful for topics where the latest state matters more than the full history.

Why it matters

Without log compaction, Kafka topics can grow indefinitely, storing every change ever made, which wastes storage and slows down consumers. Log compaction solves this by keeping only the latest update per key, making data storage efficient and enabling systems to rebuild state quickly. This is crucial for systems like caches, databases, or configurations that need the current snapshot rather than full change history.

Where it fits

Before learning log compaction, you should understand Kafka basics like topics, partitions, producers, and consumers. After mastering log compaction, you can explore Kafka's retention policies, exactly-once semantics, and stateful stream processing with Kafka Streams.

Mental Model

Core Idea

Log compaction keeps only the newest message for each key, removing older duplicates to maintain a current snapshot of data.

Think of it like...

Imagine a whiteboard where you write updates for different tasks. Instead of keeping every old note, you erase previous notes for the same task and keep only the latest one visible. This way, the whiteboard always shows the current status without clutter.

┌───────────────┐
│ Kafka Topic   │
│ (Log Storage) │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Messages with keys and values│
│                             │
│ Key1: Value1 (old)           │
│ Key2: Value2                 │
│ Key1: Value1_updated         │
│ Key3: Value3                 │
└─────────────┬───────────────┘
              │ Log Compaction
              ▼
┌─────────────────────────────┐
│ Compacted Log (latest keys) │
│                             │
│ Key1: Value1_updated        │
│ Key2: Value2                │
│ Key3: Value3                │
└─────────────────────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding Kafka Topics and Keys

Concept: Learn what Kafka topics are and how messages have keys and values.

Kafka stores messages in topics. Each message has a key and a value. The key groups related messages together. For example, a key could be a user ID, and the value could be that user's data update.

Result

You know that Kafka messages are organized by keys inside topics.

Understanding keys is essential because log compaction works by keeping the latest message per key.

2

FoundationBasics of Kafka Retention Policies

3

IntermediateWhat Log Compaction Does Differently

4

IntermediateHow Kafka Marks Messages for Compaction

5

IntermediateConfiguring Log Compaction in Kafka

6

AdvancedCompaction and Message Ordering Guarantees

7

ExpertInternal Mechanics and Performance Trade-offs

Under the Hood

Kafka stores messages in log segments on disk. The compaction process scans these segments, identifies the latest message for each key, and creates a new compacted segment with only those messages. Old segments are deleted after compaction. This process runs asynchronously and incrementally to avoid impacting normal Kafka operations.

Why designed this way?

Kafka's design balances durability, performance, and storage efficiency. Compaction was introduced to support use cases needing the latest state without losing Kafka's high throughput. Alternatives like immediate deletion or synchronous compaction would block producers or consumers, reducing performance.

┌───────────────┐
│ Log Segments  │
│ ┌───────────┐ │
│ │ Segment 1 │ │
│ │ Segment 2 │ │
│ │ Segment 3 │ │
│ └───────────┘ │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Compaction Process           │
│ - Reads segments             │
│ - Keeps latest per key       │
│ - Writes new compacted segment│
└─────────────┬───────────────┘
              │
              ▼
┌───────────────┐
│ New Segments  │
│ (Compacted)   │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does log compaction delete messages immediately after a new message with the same key arrives? Commit yes or no.

Common Belief:Log compaction deletes old messages instantly when a new message with the same key is written.

Tap to reveal reality

Quick: Does log compaction remove messages without keys? Commit yes or no.

Common Belief:Log compaction removes all messages, including those without keys.

Tap to reveal reality

Quick: Does log compaction guarantee global ordering of messages across keys? Commit yes or no.

Common Belief:Log compaction guarantees message order across all keys in a topic.

Tap to reveal reality

Quick: Can log compaction replace all retention policies? Commit yes or no.

Common Belief:Log compaction alone is enough to manage all data retention needs in Kafka.

Tap to reveal reality

Expert Zone

1

Compaction frequency and segment size tuning greatly affect Kafka cluster performance and storage efficiency.

2

Combining 'compact' and 'delete' cleanup policies allows flexible retention strategies balancing state snapshot and history.

3

Compaction can cause message duplication during recovery, so consumers must be idempotent or handle duplicates gracefully.

When NOT to use

Avoid log compaction for topics where full event history is critical, such as audit logs or event sourcing. Use time-based retention or external storage for immutable logs instead.

Production Patterns

In production, compacted topics are used for changelog streams in Kafka Streams, configuration topics, and caches. Operators monitor compaction lag and tune segment sizes to optimize throughput and storage.

Connections

Database Indexing

Similar pattern of keeping latest state for quick lookup

Understanding log compaction helps grasp how databases maintain indexes by storing only the latest record versions for fast access.

Garbage Collection in Programming Languages

Both remove outdated or unused data to free resources

Knowing how compaction cleans old messages is like understanding how garbage collectors reclaim memory, improving system efficiency.

Cache Invalidation

Both ensure the system holds the most recent valid data

Log compaction's role in keeping latest messages parallels cache invalidation strategies that keep caches fresh and consistent.

Common Pitfalls

#1Expecting log compaction to delete old messages immediately.

Wrong approach:Producing a new message with the same key and assuming the old message disappears instantly.

Correct approach:Understand that compaction runs asynchronously and old messages may remain visible until compaction completes.

Root cause:Misunderstanding Kafka's asynchronous compaction process and timing.

#2Using log compaction on topics without keys.

Wrong approach:Setting 'cleanup.policy=compact' on a topic where messages have no keys.

Correct approach:Ensure messages have keys before enabling compaction; otherwise, use time or size-based retention.

Root cause:Not realizing compaction requires keys to identify duplicates.

#3Assuming compaction preserves global message order.

Wrong approach:Writing consumer logic that depends on global ordering of compacted messages across keys.

Correct approach:Design consumers to handle ordering per key and tolerate out-of-order messages across keys.

Root cause:Confusing partition-level ordering guarantees with global ordering.

Key Takeaways

Log compaction keeps only the latest message per key, enabling efficient storage of current state in Kafka topics.

It runs asynchronously in the background, so old messages may remain visible temporarily after updates.

Compaction requires messages to have keys; messages without keys are not compacted.

Compaction preserves message order per key within partitions but not across different keys.

Proper configuration and tuning of compaction parameters are essential for balancing performance and storage in production.