Which of the following best describes the primary purpose of Hadoop's DistCp tool in backup and disaster recovery?
Think about tools designed to move data between clusters.
DistCp (Distributed Copy) is designed to copy large datasets efficiently across Hadoop clusters, which is essential for backup and disaster recovery.
What will be the output of the following HDFS command sequence?
hdfs dfs -mkdir /data hdfs dfs -put file1.txt /data/ hdfs dfs -createSnapshot /data snap1 hdfs dfs -rm /data/file1.txt hdfs dfs -ls /data/.snapshot/snap1
Remember what snapshots do in HDFS.
HDFS snapshots preserve the state of the directory at the time of snapshot creation. Even if file1.txt is deleted later, it remains visible inside the snapshot.
You run a Hadoop backup job that uses compression. The original data size is 500 GB. After backup, the compressed backup size is 150 GB. What is the compression ratio?
Compression ratio = original size / compressed size.
The compression ratio is calculated by dividing the original data size by the compressed size: 500 / 150 = 3.33.
Consider this Hadoop backup script snippet:
hdfs dfs -mkdir /backup hadoop distcp /user/data /backup/data_backup hdfs dfs -rm -r /user/data
What is the main risk or error in this script?
Think about safe backup practices.
Deleting original data immediately after copying without checking if the backup succeeded risks data loss if the copy fails.
You are tasked with designing a disaster recovery plan for a Hadoop cluster that must minimize downtime and data loss. Which combination of strategies is best?
Consider both data safety and recovery speed.
Combining HDFS snapshots for quick recovery and remote replication with DistCp ensures minimal downtime and protects against cluster-wide failures.