In the MapReduce job execution flow, which step happens first?
Think about how the data is prepared before processing.
The first step is splitting the input data into chunks so that each map task can process a part of the data in parallel.
During the MapReduce job execution, what is the main purpose of the shuffle phase?
Think about how data moves from mappers to reducers.
The shuffle phase sorts and transfers the output of map tasks to the appropriate reduce tasks based on keys.
Given the input data ["apple", "banana", "apple"] and a map function that outputs (word, 1) for each word, what is the map phase output?
input_data = ["apple", "banana", "apple"] map_output = [(word, 1) for word in input_data] print(map_output)
The map function emits one pair per input word.
The map phase outputs a list of key-value pairs, one for each input element, without aggregation.
Given the intermediate data {'apple': [1, 1], 'banana': [1]}, what is the output of the reducer that sums the values?
intermediate_data = {'apple': [1, 1], 'banana': [1]}
reducer_output = {k: sum(v) for k, v in intermediate_data.items()}
print(reducer_output)The reducer sums all values for each key.
The reducer adds the list of values for each key to produce the final count.
In Hadoop's MapReduce architecture, which component is responsible for managing the entire job execution, including resource allocation and task scheduling?
Think about the component that coordinates tasks across the cluster.
The JobTracker manages job execution, resource allocation, and task scheduling in Hadoop MapReduce.