0
0
Kafkadevops~10 mins

Disaster recovery planning in Kafka - Commands & Configuration

Choose your learning style9 modes available
Introduction
Disaster recovery planning helps you prepare for unexpected failures in your Kafka system. It ensures your data and services can be restored quickly after problems like hardware failure or data corruption.
When you want to protect your Kafka data from accidental deletion or corruption.
When you need to recover Kafka topics and messages after a server crash.
When you want to replicate Kafka data to another data center for backup.
When you want to automate restoring Kafka configurations and topics after failure.
When you want to minimize downtime and data loss during Kafka outages.
Config File - server.properties
server.properties
broker.id=1
log.dirs=/var/lib/kafka-logs
zookeeper.connect=localhost:2181
# Enable topic auto-creation
auto.create.topics.enable=true
# Enable log retention for disaster recovery
log.retention.hours=168
# Enable replication for fault tolerance
default.replication.factor=3
min.insync.replicas=2
# Enable log cleaner for compacted topics
log.cleaner.enable=true
# Configure listeners
listeners=PLAINTEXT://:9092
# Configure inter-broker communication
inter.broker.listener.name=PLAINTEXT

This configuration file sets up a Kafka broker with key settings for disaster recovery:

  • log.retention.hours=168: Keeps logs for 7 days to allow recovery from recent data loss.
  • default.replication.factor=3: Ensures each topic partition is copied to 3 brokers for fault tolerance.
  • min.insync.replicas=2: Requires at least 2 replicas to acknowledge writes, preventing data loss.
  • log.cleaner.enable=true: Enables log compaction for topics where only the latest value per key is kept, useful for recovery.
Commands
This command creates a Kafka topic named 'disaster-recovery-test' with 3 partitions and replication factor 3 to ensure data is copied across brokers for recovery.
Terminal
kafka-topics.sh --create --topic disaster-recovery-test --bootstrap-server localhost:9092 --partitions 3 --replication-factor 3
Expected OutputExpected
Created topic disaster-recovery-test.
--partitions - Number of partitions for parallelism and fault tolerance
--replication-factor - Number of copies of each partition for data safety
This command starts a producer to send messages to the 'disaster-recovery-test' topic to simulate data input.
Terminal
kafka-console-producer.sh --topic disaster-recovery-test --bootstrap-server localhost:9092
Expected OutputExpected
No output (command runs silently)
--topic - Specifies the topic to send messages to
This command reads the first 5 messages from the 'disaster-recovery-test' topic to verify data is stored and replicated correctly.
Terminal
kafka-console-consumer.sh --topic disaster-recovery-test --bootstrap-server localhost:9092 --from-beginning --max-messages 5
Expected OutputExpected
message1 message2 message3 message4 message5
--from-beginning - Reads messages from the start of the topic
--max-messages - Limits the number of messages to read
This command starts a partition reassignment to move partitions to different brokers as part of recovery or maintenance.
Terminal
kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file reassignment.json --execute
Expected OutputExpected
Starting partition reassignment for topic disaster-recovery-test. Reassignment started successfully.
--execute - Executes the reassignment described in the JSON file
This command verifies the status of the partition reassignment to ensure it completed successfully.
Terminal
kafka-reassign-partitions.sh --zookeeper localhost:2181 --verify
Expected OutputExpected
Reassignment of partitions is complete.
--verify - Checks the progress and completion of reassignment
Key Concept

If you remember nothing else from disaster recovery planning in Kafka, remember: replication and log retention settings are your first line of defense against data loss.

Common Mistakes
Setting replication factor to 1 for topics
This means there is only one copy of data, so if that broker fails, data is lost.
Always set replication factor to at least 2 or 3 for important topics to ensure copies exist.
Not verifying partition reassignment status
You might think reassignment finished but it could be stuck, causing uneven load or data loss risk.
Always run the verify command after reassignment to confirm success.
Not enabling log retention or setting it too low
Kafka deletes old data too soon, making recovery impossible for recent messages.
Set log retention to a reasonable time (e.g., 7 days) to allow recovery window.
Summary
Create topics with replication factor 3 to keep multiple copies of data.
Use producers and consumers to test data flow and verify replication.
Use partition reassignment commands to move data safely during recovery.
Verify reassignment completion to ensure cluster health.
Configure log retention and replication in server.properties for disaster readiness.