In Hadoop MapReduce, what is the primary purpose of input splits?
Think about how Hadoop processes large datasets efficiently.
Input splits break the input data into smaller pieces so that each mapper can process a chunk independently, enabling parallelism.
Why is data locality important in Hadoop's processing model?
Consider how moving computation close to data affects performance.
Data locality reduces the need to transfer large data over the network, improving speed and efficiency.
Given a 640 MB file stored in HDFS with a block size of 128 MB, how many input splits will Hadoop create by default?
Divide the total file size by the block size.
640 MB / 128 MB = 5 splits, each corresponding to one block.
Consider this Hadoop MapReduce job snippet using a custom InputFormat that combines small files into larger splits:
job.setInputFormatClass(CombineFileInputFormat.class);
If the input directory contains 10 files of 10 MB each and the block size is 128 MB, how many input splits will be created?
CombineFileInputFormat merges small files into splits up to block size.
Since total size is 100 MB, less than 128 MB block size, CombineFileInputFormat creates one split.
You have a Hadoop cluster with 5 nodes. A large dataset is stored unevenly: 80% on node 1, and the rest spread across nodes 2-5. You want to maximize data locality for a MapReduce job. Which strategy will best improve data locality?
Think about how data distribution affects task assignment.
Evenly distributing data allows mappers to run on multiple nodes close to their data, improving locality and parallelism.