Overview - Topic configuration

What is it?

Topic configuration in Kafka means setting up rules and options that control how a topic behaves. A topic is like a category or feed name where messages are stored and read. Configurations include things like how many copies of data to keep, how long to keep messages, and how big the topic can grow. These settings help Kafka manage data efficiently and reliably.

Why it matters

Without proper topic configuration, Kafka might lose data, run out of space, or deliver messages slowly. Imagine a mailbox that overflows or loses letters because it wasn't set up right. Good configuration ensures data is safe, available, and fast to access, which is critical for apps that rely on real-time data streams.

Where it fits

Before learning topic configuration, you should understand Kafka basics like what topics and partitions are. After mastering configuration, you can explore Kafka cluster tuning, security settings, and advanced features like log compaction and retention policies.

Mental Model

Core Idea

Topic configuration sets the rules that control how Kafka stores, retains, and manages messages in each topic.

Think of it like...

Think of a Kafka topic like a library shelf where books (messages) are stored. Topic configuration decides how many copies of each book to keep, how long to keep them on the shelf, and how big the shelf can be before you need to remove old books.

┌─────────────────────────────┐
│         Kafka Topic         │
├─────────────┬───────────────┤
│ Partitions  │ Configurations│
│ (Message   │ ┌─────────────┐│
│  storage)  │ │Retention    ││
│            │ │Replication  ││
│            │ │Cleanup      ││
│            │ │Policies     ││
└─────────────┴───────────────┘

Build-Up - 7 Steps

1

FoundationWhat is a Kafka Topic

Concept: Introduce the basic concept of a Kafka topic as a message category.

A Kafka topic is a named stream where messages are published and consumed. It acts like a folder or channel for messages. Topics are split into partitions to allow parallel processing and scalability.

Result

You understand that topics organize messages and are the main unit of data in Kafka.

Knowing what a topic is helps you see why configuring it properly affects how data flows and is stored.

2

FoundationBasic Topic Configuration Options

3

IntermediateHow Retention and Cleanup Work

4

IntermediateReplication Factor and Fault Tolerance

5

IntermediatePartition Count and Parallelism

6

AdvancedAdvanced Configuration: Log Compaction

7

ExpertTuning Topic Configurations for Production

Under the Hood

Kafka stores topic data as ordered logs split into partitions. Each partition is replicated across brokers based on replication factor. The leader broker handles writes and reads, while followers replicate data. Retention policies run background cleanup threads that delete or compact old log segments based on configured rules. This design ensures high throughput, fault tolerance, and efficient storage.

Why designed this way?

Kafka was designed for high-scale, distributed messaging with durability and speed. Partitioning allows parallelism, replication ensures fault tolerance, and configurable retention balances storage and data availability. Alternatives like traditional message queues lacked this scale or durability. The log-based design simplifies recovery and replay.

┌───────────────┐
│ Kafka Cluster │
├───────────────┤
│ Broker 1      │
│ ┌───────────┐ │
│ │Partition 1│◄───────── Leader
│ └───────────┘ │
│               │
│ Broker 2      │
│ ┌───────────┐ │
│ │Partition 1│◄───────── Follower
│ └───────────┘ │
│               │
│ Broker 3      │
│ ┌───────────┐ │
│ │Partition 1│◄───────── Follower
│ └───────────┘ │
└───────────────┘

Retention & Cleanup Thread → Deletes or compacts old log segments

Myth Busters - 4 Common Misconceptions

Quick: Does increasing retention.ms always prevent data loss? Commit yes or no.

Common Belief:If I set retention.ms very high, my data will never be lost.

Tap to reveal reality

Quick: Does increasing partitions always improve performance? Commit yes or no.

Common Belief:More partitions always mean better throughput and faster processing.

Tap to reveal reality

Quick: Does replication factor improve message delivery speed? Commit yes or no.

Common Belief:Higher replication factor makes message delivery faster because more copies exist.

Tap to reveal reality

Quick: Does log compaction delete all old messages? Commit yes or no.

Common Belief:Log compaction deletes all old messages to save space.

Tap to reveal reality

Expert Zone

1

Replication factor impacts not just fault tolerance but also leader election speed and cluster recovery time.

2

Retention policies interact with disk usage and consumer lag; tuning one without the other can cause unexpected data loss or backlog.

3

Partition count affects not only throughput but also metadata size and controller load, influencing cluster stability.

When NOT to use

Avoid using very high partition counts for small workloads; instead, use fewer partitions and scale consumers. For topics requiring full message history, do not use log compaction. When low latency is critical, balance replication factor carefully to avoid added write delays.

Production Patterns

In production, teams often use separate topics for raw data with long retention and compacted topics for state. They monitor topic sizes and adjust retention dynamically. Replication factors of 3 are common for fault tolerance. Partition counts are chosen based on expected consumer parallelism and throughput needs.

Connections

Distributed Systems

Topic configuration builds on distributed system principles like replication and partitioning.

Understanding distributed consensus and fault tolerance helps grasp why Kafka replicates partitions and how it manages failures.

Database Indexing

Log compaction in Kafka is similar to database indexing and cleanup strategies.

Knowing how databases keep only relevant data versions clarifies why Kafka compacts logs to keep latest states.

Supply Chain Management

Kafka topic configuration parallels inventory management rules in supply chains.

Just like warehouses decide how much stock to keep and when to discard old items, Kafka topics configure how long and how much data to retain.

Common Pitfalls

#1Setting retention.ms too low causes data to be deleted before consumers read it.

Wrong approach:kafka-topics.sh --create --topic my-topic --partitions 3 --replication-factor 2 --config retention.ms=60000

Correct approach:kafka-topics.sh --create --topic my-topic --partitions 3 --replication-factor 2 --config retention.ms=604800000

Root cause:Misunderstanding retention.ms units (milliseconds) and setting it too low for consumer needs.

#2Creating too many partitions for a small workload increases overhead and slows cluster.

Wrong approach:kafka-topics.sh --create --topic small-topic --partitions 100 --replication-factor 1

Correct approach:kafka-topics.sh --create --topic small-topic --partitions 4 --replication-factor 1

Root cause:Assuming more partitions always improve performance without considering cluster resource limits.

#3Using replication factor of 1 in production risks data loss on broker failure.

Wrong approach:kafka-topics.sh --create --topic critical-topic --partitions 6 --replication-factor 1

Correct approach:kafka-topics.sh --create --topic critical-topic --partitions 6 --replication-factor 3

Root cause:Underestimating the importance of replication for fault tolerance.

Key Takeaways

Kafka topic configuration controls how messages are stored, replicated, and cleaned up to balance reliability and performance.

Key settings like partitions, replication factor, and retention policies directly affect data durability and throughput.

Log compaction is a special cleanup mode that keeps only the latest message per key, useful for stateful data.

Proper tuning of topic configurations is essential in production to avoid data loss, performance issues, and storage waste.

Understanding the internal mechanisms of Kafka topics helps make informed decisions about configuration tradeoffs.