Hadoopdata~30 mins

Backup and disaster recovery in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Backup and Disaster Recovery in Hadoop

📖 Scenario: You work as a data engineer managing a Hadoop cluster that stores important company data. To protect against data loss from hardware failure or accidental deletion, you need to create a backup of critical files and set up a simple disaster recovery plan.

🎯 Goal: Build a basic Hadoop backup and disaster recovery script that copies important files from the main Hadoop Distributed File System (HDFS) directory to a backup directory. Then, verify the backup by listing the files in the backup location.

📋 What You'll Learn

Create a list of important HDFS file paths to back up

Set a backup directory path variable

Write a loop to copy each important file to the backup directory using Hadoop commands

Print the list of files in the backup directory to confirm backup success

💡 Why This Matters

🌍 Real World

Backing up data in Hadoop is critical to prevent data loss from hardware failures or accidental deletions. This project simulates a simple backup and recovery process.

💼 Career

Data engineers and Hadoop administrators regularly create backup and disaster recovery plans to ensure data safety and business continuity.

Progress0 / 4 steps

Create a list of important HDFS files

Create a list called important_files with these exact HDFS file paths: /data/sales.csv, /data/customers.csv, and /data/products.csv.

Hadoop

# Create the list of important HDFS files
# Your code here

Need a hint?

Use square brackets to create a list and include the file paths as strings.

Set the backup directory path

Create a variable called backup_dir and set it to the string /backup/ which will be the HDFS backup directory.

Hadoop

important_files = ['/data/sales.csv', '/data/customers.csv', '/data/products.csv']
# Set the backup directory path
# Your code here

Need a hint?

Assign the string '/backup/' to the variable backup_dir.

Copy important files to the backup directory

Write a for loop using the variable file_path to iterate over important_files. Inside the loop, use the Hadoop command hadoop fs -cp to copy each file_path to backup_dir. Use an f-string to build the command string exactly as: hadoop fs -cp {file_path} {backup_dir}. Use os.system() to run the command.

Hadoop

important_files = ['/data/sales.csv', '/data/customers.csv', '/data/products.csv']
backup_dir = '/backup/'
import os
# Copy each important file to the backup directory
# Your code here

Need a hint?

Use a for loop with file_path and build the copy command with an f-string. Then run it with os.system().

List files in the backup directory to verify

Use os.system() to run the Hadoop command hadoop fs -ls {backup_dir} to list the files in the backup directory and confirm the backup.

Hadoop

important_files = ['/data/sales.csv', '/data/customers.csv', '/data/products.csv']
backup_dir = '/backup/'
import os
for file_path in important_files:
    command = f'hadoop fs -cp {file_path} {backup_dir}'
    os.system(command)
# List files in backup directory
# Your code here

Need a hint?

Use os.system() with the list command to show files in the backup directory.