0
0
Kafkadevops~5 mins

Log compaction in Kafka - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Log compaction
O(n)
Understanding Time Complexity

Log compaction in Kafka helps keep only the latest value for each key in a topic.

We want to understand how the time to compact logs grows as the log size increases.

Scenario Under Consideration

Analyze the time complexity of this simplified log compaction process.


    Map<String, String> latestRecords = new HashMap<>();
    for (Record record : log) {
        latestRecords.put(record.key(), record.value());
    }
    List<Record> compactedLog = new ArrayList<>();
    for (Map.Entry<String, String> entry : latestRecords.entrySet()) {
        compactedLog.add(new Record(entry.getKey(), entry.getValue()));
    }
    

This code keeps only the newest record for each key by scanning the log once and then rebuilding the compacted log.

Identify Repeating Operations

Look at what repeats as the log grows.

  • Primary operation: Looping through all records in the log once.
  • How many times: Exactly once for each record in the log.
  • Secondary operation: Looping through unique keys to build the compacted log.
  • How many times: Once for each unique key, which is at most the number of records.
How Execution Grows With Input

As the log size grows, the number of operations grows roughly the same way.

Input Size (n)Approx. Operations
10About 20 (10 to scan + up to 10 to rebuild)
100About 200 (100 to scan + up to 100 to rebuild)
1000About 2000 (1000 to scan + up to 1000 to rebuild)

Pattern observation: The total work grows roughly in direct proportion to the log size.

Final Time Complexity

Time Complexity: O(n)

This means the time to compact grows linearly with the number of records in the log.

Common Mistake

[X] Wrong: "Log compaction takes longer than linear time because it compares every record to every other record."

[OK] Correct: The process only scans the log once and uses a map to keep track of the latest record per key, so it does not compare records pairwise.

Interview Connect

Understanding how log compaction scales helps you explain efficient data processing in distributed systems.

Self-Check

What if the log contained many duplicate keys clustered together? How would that affect the time complexity?