0
0
Hadoopdata~20 mins

Input splits and data locality in Hadoop - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Input Splits and Data Locality Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Input Splits in Hadoop

In Hadoop MapReduce, what is the primary purpose of input splits?

ATo store the output data after processing
BTo compress data before sending it to reducers
CTo divide the input data into manageable chunks for parallel processing
DTo replicate data across nodes for fault tolerance
Attempts:
2 left
💡 Hint

Think about how Hadoop processes large datasets efficiently.

🧠 Conceptual
intermediate
2:00remaining
Data Locality Importance

Why is data locality important in Hadoop's processing model?

AIt ensures data is encrypted during processing
BIt reduces network traffic by processing data on the node where it is stored
CIt balances the load evenly across all nodes regardless of data location
DIt compresses data to save storage space
Attempts:
2 left
💡 Hint

Consider how moving computation close to data affects performance.

data_output
advanced
2:00remaining
Input Split Sizes Calculation

Given a 640 MB file stored in HDFS with a block size of 128 MB, how many input splits will Hadoop create by default?

A5
B4
C6
D8
Attempts:
2 left
💡 Hint

Divide the total file size by the block size.

Predict Output
advanced
2:00remaining
Effect of Custom InputFormat on Splits

Consider this Hadoop MapReduce job snippet using a custom InputFormat that combines small files into larger splits:

job.setInputFormatClass(CombineFileInputFormat.class);

If the input directory contains 10 files of 10 MB each and the block size is 128 MB, how many input splits will be created?

A5
B10
C0
D1
Attempts:
2 left
💡 Hint

CombineFileInputFormat merges small files into splits up to block size.

🚀 Application
expert
3:00remaining
Optimizing Data Locality in a Hadoop Cluster

You have a Hadoop cluster with 5 nodes. A large dataset is stored unevenly: 80% on node 1, and the rest spread across nodes 2-5. You want to maximize data locality for a MapReduce job. Which strategy will best improve data locality?

ADistribute the dataset evenly across all nodes before running the job
BRun all mappers only on node 1 to process the majority of data
CIncrease the number of reducers to balance the load
DCompress the dataset to reduce its size
Attempts:
2 left
💡 Hint

Think about how data distribution affects task assignment.