Log compaction in Kafka - Time & Space Complexity
Log compaction in Kafka helps keep only the latest value for each key in a topic.
We want to understand how the time to compact logs grows as the log size increases.
Analyze the time complexity of this simplified log compaction process.
Map<String, String> latestRecords = new HashMap<>();
for (Record record : log) {
latestRecords.put(record.key(), record.value());
}
List<Record> compactedLog = new ArrayList<>();
for (Map.Entry<String, String> entry : latestRecords.entrySet()) {
compactedLog.add(new Record(entry.getKey(), entry.getValue()));
}
This code keeps only the newest record for each key by scanning the log once and then rebuilding the compacted log.
Look at what repeats as the log grows.
- Primary operation: Looping through all records in the log once.
- How many times: Exactly once for each record in the log.
- Secondary operation: Looping through unique keys to build the compacted log.
- How many times: Once for each unique key, which is at most the number of records.
As the log size grows, the number of operations grows roughly the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 (10 to scan + up to 10 to rebuild) |
| 100 | About 200 (100 to scan + up to 100 to rebuild) |
| 1000 | About 2000 (1000 to scan + up to 1000 to rebuild) |
Pattern observation: The total work grows roughly in direct proportion to the log size.
Time Complexity: O(n)
This means the time to compact grows linearly with the number of records in the log.
[X] Wrong: "Log compaction takes longer than linear time because it compares every record to every other record."
[OK] Correct: The process only scans the log once and uses a map to keep track of the latest record per key, so it does not compare records pairwise.
Understanding how log compaction scales helps you explain efficient data processing in distributed systems.
What if the log contained many duplicate keys clustered together? How would that affect the time complexity?