Hadoopdata~30 mins

Input splits and data locality in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Understanding Input Splits and Data Locality in Hadoop

📖 Scenario: You are working with a large text file stored in Hadoop Distributed File System (HDFS). You want to understand how Hadoop divides this file into input splits and how data locality helps improve processing speed.

🎯 Goal: Build a simple Hadoop MapReduce job setup that shows how input splits are created and how data locality is determined for processing.

📋 What You'll Learn

Create a sample input file path variable

Set a split size configuration variable

Write code to calculate input splits based on file size and split size

Print the number of splits and their data locality information

💡 Why This Matters

🌍 Real World

Hadoop processes huge data files by splitting them into chunks called input splits. This helps distribute work across many computers.

💼 Career

Understanding input splits and data locality is important for optimizing big data jobs and improving processing speed in data engineering roles.

Progress0 / 4 steps

Create the input file path and file size

Create a variable called input_file_path and set it to "/user/hadoop/input/largefile.txt". Also create a variable called file_size_bytes and set it to 134217728 (which is 128 MB).

Hadoop

# Create variables for input file path and file size
# Your code here

Need a hint?

Use simple assignment to create input_file_path and file_size_bytes variables.

Set the split size configuration

Create a variable called split_size_bytes and set it to 33554432 (which is 32 MB).

Hadoop

input_file_path = "/user/hadoop/input/largefile.txt"
file_size_bytes = 134217728
# Set the split size in bytes
# Your code here

Need a hint?

Set split_size_bytes to 32 MB in bytes.

Calculate the number of input splits

Create a variable called num_splits that calculates how many splits the file will be divided into. Use integer division and add 1 if there is a remainder. Use file_size_bytes and split_size_bytes.

Hadoop

input_file_path = "/user/hadoop/input/largefile.txt"
file_size_bytes = 134217728
split_size_bytes = 33554432
# Calculate number of splits
# Your code here

Need a hint?

Use integer division // and modulo % to calculate splits.

Print the number of splits and data locality info

Print the text "Number of input splits:" followed by num_splits. Then print "Data locality helps run tasks on nodes where data is stored.".

Hadoop

input_file_path = "/user/hadoop/input/largefile.txt"
file_size_bytes = 134217728
split_size_bytes = 33554432
num_splits = file_size_bytes // split_size_bytes + (1 if file_size_bytes % split_size_bytes != 0 else 0)
# Print number of splits and data locality info
# Your code here

Need a hint?

Use two print statements to show the results.