0
0
Kafkadevops~15 mins

Log compaction in Kafka - Deep Dive

Choose your learning style9 modes available
Overview - Log compaction
What is it?
Log compaction is a feature in Apache Kafka that keeps the latest value for each unique key in a topic. Instead of deleting old messages based on time or size, it removes older duplicates of the same key, ensuring only the most recent update remains. This helps maintain a compacted log that represents the current state of data. It is useful for topics where the latest state matters more than the full history.
Why it matters
Without log compaction, Kafka topics can grow indefinitely, storing every change ever made, which wastes storage and slows down consumers. Log compaction solves this by keeping only the latest update per key, making data storage efficient and enabling systems to rebuild state quickly. This is crucial for systems like caches, databases, or configurations that need the current snapshot rather than full change history.
Where it fits
Before learning log compaction, you should understand Kafka basics like topics, partitions, producers, and consumers. After mastering log compaction, you can explore Kafka's retention policies, exactly-once semantics, and stateful stream processing with Kafka Streams.
Mental Model
Core Idea
Log compaction keeps only the newest message for each key, removing older duplicates to maintain a current snapshot of data.
Think of it like...
Imagine a whiteboard where you write updates for different tasks. Instead of keeping every old note, you erase previous notes for the same task and keep only the latest one visible. This way, the whiteboard always shows the current status without clutter.
┌───────────────┐
│ Kafka Topic   │
│ (Log Storage) │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Messages with keys and values│
│                             │
│ Key1: Value1 (old)           │
│ Key2: Value2                 │
│ Key1: Value1_updated         │
│ Key3: Value3                 │
└─────────────┬───────────────┘
              │ Log Compaction
              ▼
┌─────────────────────────────┐
│ Compacted Log (latest keys) │
│                             │
│ Key1: Value1_updated        │
│ Key2: Value2                │
│ Key3: Value3                │
└─────────────────────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding Kafka Topics and Keys
🤔
Concept: Learn what Kafka topics are and how messages have keys and values.
Kafka stores messages in topics. Each message has a key and a value. The key groups related messages together. For example, a key could be a user ID, and the value could be that user's data update.
Result
You know that Kafka messages are organized by keys inside topics.
Understanding keys is essential because log compaction works by keeping the latest message per key.
2
FoundationBasics of Kafka Retention Policies
🤔
Concept: Learn how Kafka normally deletes old messages based on time or size.
Kafka topics have retention policies that delete messages after a set time or when the log size grows too large. This keeps storage manageable but loses old data permanently.
Result
You understand that Kafka deletes old messages by default, which may not suit all use cases.
Knowing retention policies helps you see why log compaction is needed for use cases requiring the latest state.
3
IntermediateWhat Log Compaction Does Differently
🤔Before reading on: do you think log compaction deletes messages by age or by key? Commit to your answer.
Concept: Log compaction deletes old messages based on keys, not time or size.
Instead of deleting messages by age, log compaction scans the log and removes older messages with the same key, keeping only the newest one. This means the topic always has the latest state per key.
Result
The topic retains a compacted log with one message per key, representing the current state.
Understanding that compaction works by key, not time, explains why it is useful for stateful data.
4
IntermediateHow Kafka Marks Messages for Compaction
🤔Before reading on: do you think Kafka immediately deletes old messages during compaction or marks them first? Commit to your answer.
Concept: Kafka marks messages as eligible for deletion during compaction but deletes them asynchronously.
Kafka uses a background process to compact logs. It reads messages, keeps the latest per key, and marks older duplicates as deleted. Actual deletion happens later to avoid blocking producers and consumers.
Result
Compaction runs without stopping Kafka operations, ensuring smooth performance.
Knowing compaction is asynchronous helps understand why old messages may still appear briefly after compaction starts.
5
IntermediateConfiguring Log Compaction in Kafka
🤔
Concept: Learn how to enable and tune log compaction on Kafka topics.
To enable log compaction, set the topic configuration 'cleanup.policy' to 'compact'. You can also combine it with 'delete' to keep both compaction and time-based deletion. Other settings control compaction frequency and thresholds.
Result
You can create topics that keep only the latest messages per key automatically.
Knowing configuration options lets you tailor compaction to your system's needs.
6
AdvancedCompaction and Message Ordering Guarantees
🤔Before reading on: does log compaction guarantee message order for all keys or only per key? Commit to your answer.
Concept: Log compaction preserves order per key but not across different keys.
Kafka guarantees message order within a partition. Compaction keeps the latest message per key but does not reorder messages across keys. Consumers must handle this when rebuilding state.
Result
You understand how compaction affects message order and state reconstruction.
Knowing order guarantees prevents bugs when consuming compacted topics.
7
ExpertInternal Mechanics and Performance Trade-offs
🤔Before reading on: do you think compaction runs continuously or in batches? Commit to your answer.
Concept: Compaction runs in batches asynchronously to balance performance and storage savings.
Kafka runs compaction in background threads, processing segments of the log at a time. This avoids blocking producers and consumers but means compaction is eventually consistent. Tuning compaction parameters affects CPU, disk I/O, and latency trade-offs.
Result
You can optimize compaction for your workload by adjusting parameters.
Understanding compaction internals helps prevent performance issues in production.
Under the Hood
Kafka stores messages in log segments on disk. The compaction process scans these segments, identifies the latest message for each key, and creates a new compacted segment with only those messages. Old segments are deleted after compaction. This process runs asynchronously and incrementally to avoid impacting normal Kafka operations.
Why designed this way?
Kafka's design balances durability, performance, and storage efficiency. Compaction was introduced to support use cases needing the latest state without losing Kafka's high throughput. Alternatives like immediate deletion or synchronous compaction would block producers or consumers, reducing performance.
┌───────────────┐
│ Log Segments  │
│ ┌───────────┐ │
│ │ Segment 1 │ │
│ │ Segment 2 │ │
│ │ Segment 3 │ │
│ └───────────┘ │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Compaction Process           │
│ - Reads segments             │
│ - Keeps latest per key       │
│ - Writes new compacted segment│
└─────────────┬───────────────┘
              │
              ▼
┌───────────────┐
│ New Segments  │
│ (Compacted)   │
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does log compaction delete messages immediately after a new message with the same key arrives? Commit yes or no.
Common Belief:Log compaction deletes old messages instantly when a new message with the same key is written.
Tap to reveal reality
Reality:Compaction runs asynchronously in the background and does not delete old messages immediately.
Why it matters:Expecting immediate deletion can cause confusion when old messages appear temporarily, leading to incorrect assumptions about data freshness.
Quick: Does log compaction remove messages without keys? Commit yes or no.
Common Belief:Log compaction removes all messages, including those without keys.
Tap to reveal reality
Reality:Only messages with keys are compacted; messages without keys are retained until deleted by time or size policies.
Why it matters:Misunderstanding this can cause data loss or unexpected retention behavior for keyless messages.
Quick: Does log compaction guarantee global ordering of messages across keys? Commit yes or no.
Common Belief:Log compaction guarantees message order across all keys in a topic.
Tap to reveal reality
Reality:Compaction preserves order only per key within partitions, not across different keys.
Why it matters:Assuming global order can cause bugs in state reconstruction or processing logic.
Quick: Can log compaction replace all retention policies? Commit yes or no.
Common Belief:Log compaction alone is enough to manage all data retention needs in Kafka.
Tap to reveal reality
Reality:Log compaction is designed for stateful data; time or size-based retention is still needed for other use cases.
Why it matters:Relying solely on compaction can cause unbounded log growth or data loss in some scenarios.
Expert Zone
1
Compaction frequency and segment size tuning greatly affect Kafka cluster performance and storage efficiency.
2
Combining 'compact' and 'delete' cleanup policies allows flexible retention strategies balancing state snapshot and history.
3
Compaction can cause message duplication during recovery, so consumers must be idempotent or handle duplicates gracefully.
When NOT to use
Avoid log compaction for topics where full event history is critical, such as audit logs or event sourcing. Use time-based retention or external storage for immutable logs instead.
Production Patterns
In production, compacted topics are used for changelog streams in Kafka Streams, configuration topics, and caches. Operators monitor compaction lag and tune segment sizes to optimize throughput and storage.
Connections
Database Indexing
Similar pattern of keeping latest state for quick lookup
Understanding log compaction helps grasp how databases maintain indexes by storing only the latest record versions for fast access.
Garbage Collection in Programming Languages
Both remove outdated or unused data to free resources
Knowing how compaction cleans old messages is like understanding how garbage collectors reclaim memory, improving system efficiency.
Cache Invalidation
Both ensure the system holds the most recent valid data
Log compaction's role in keeping latest messages parallels cache invalidation strategies that keep caches fresh and consistent.
Common Pitfalls
#1Expecting log compaction to delete old messages immediately.
Wrong approach:Producing a new message with the same key and assuming the old message disappears instantly.
Correct approach:Understand that compaction runs asynchronously and old messages may remain visible until compaction completes.
Root cause:Misunderstanding Kafka's asynchronous compaction process and timing.
#2Using log compaction on topics without keys.
Wrong approach:Setting 'cleanup.policy=compact' on a topic where messages have no keys.
Correct approach:Ensure messages have keys before enabling compaction; otherwise, use time or size-based retention.
Root cause:Not realizing compaction requires keys to identify duplicates.
#3Assuming compaction preserves global message order.
Wrong approach:Writing consumer logic that depends on global ordering of compacted messages across keys.
Correct approach:Design consumers to handle ordering per key and tolerate out-of-order messages across keys.
Root cause:Confusing partition-level ordering guarantees with global ordering.
Key Takeaways
Log compaction keeps only the latest message per key, enabling efficient storage of current state in Kafka topics.
It runs asynchronously in the background, so old messages may remain visible temporarily after updates.
Compaction requires messages to have keys; messages without keys are not compacted.
Compaction preserves message order per key within partitions but not across different keys.
Proper configuration and tuning of compaction parameters are essential for balancing performance and storage in production.