0
0
Hadoopdata~30 mins

Input splits and data locality in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Understanding Input Splits and Data Locality in Hadoop
📖 Scenario: You are working with a large text file stored in Hadoop Distributed File System (HDFS). You want to understand how Hadoop divides this file into input splits and how data locality helps improve processing speed.
🎯 Goal: Build a simple Hadoop MapReduce job setup that shows how input splits are created and how data locality is determined for processing.
📋 What You'll Learn
Create a sample input file path variable
Set a split size configuration variable
Write code to calculate input splits based on file size and split size
Print the number of splits and their data locality information
💡 Why This Matters
🌍 Real World
Hadoop processes huge data files by splitting them into chunks called input splits. This helps distribute work across many computers.
💼 Career
Understanding input splits and data locality is important for optimizing big data jobs and improving processing speed in data engineering roles.
Progress0 / 4 steps
1
Create the input file path and file size
Create a variable called input_file_path and set it to "/user/hadoop/input/largefile.txt". Also create a variable called file_size_bytes and set it to 134217728 (which is 128 MB).
Hadoop
Need a hint?

Use simple assignment to create input_file_path and file_size_bytes variables.

2
Set the split size configuration
Create a variable called split_size_bytes and set it to 33554432 (which is 32 MB).
Hadoop
Need a hint?

Set split_size_bytes to 32 MB in bytes.

3
Calculate the number of input splits
Create a variable called num_splits that calculates how many splits the file will be divided into. Use integer division and add 1 if there is a remainder. Use file_size_bytes and split_size_bytes.
Hadoop
Need a hint?

Use integer division // and modulo % to calculate splits.

4
Print the number of splits and data locality info
Print the text "Number of input splits:" followed by num_splits. Then print "Data locality helps run tasks on nodes where data is stored.".
Hadoop
Need a hint?

Use two print statements to show the results.