In Hadoop MapReduce, after the Map phase processes input data, the Reduce phase takes over. What is the main role of the Reduce phase?
Think about what happens after the Map phase outputs key-value pairs.
The Reduce phase collects all intermediate values associated with the same key and combines them to produce a summarized output. This aggregation is the main purpose of the Reduce phase.
Consider the following simplified Reduce function in Hadoop MapReduce that sums values for each key:
def reduce(key, values):
total = 0
for v in values:
total += v
print(f"{key}: {total}")What will be the output if the input to reduce is key = 'apple' and values = [2, 3, 5]?
def reduce(key, values): total = 0 for v in values: total += v print(f"{key}: {total}") reduce('apple', [2, 3, 5])
Sum all numbers in the list.
The Reduce function sums all values in the list [2, 3, 5], which equals 10, and prints it with the key.
Given the following intermediate key-value pairs from the Map phase:
{'cat': [1, 1, 1], 'dog': [1, 1], 'bird': [1]}If the Reduce phase sums the values for each key, what is the resulting output?
Sum the list of values for each key.
Each key's values are summed: cat (1+1+1=3), dog (1+1=2), bird (1).
Look at this Reduce function code snippet:
def reduce(key, values):
total = 0
for v in values
total += v
print(f"{key}: {total}")What error will this code produce when run?
Check the for loop syntax carefully.
The for loop is missing a colon at the end of the line, causing a SyntaxError.
In a word count MapReduce job, the Map phase outputs key-value pairs where the key is a word and the value is 1 for each occurrence. The Reduce phase sums these counts. Given the following Map output for the word 'data':
[('data', 1), ('data', 1), ('data', 1), ('data', 1)]Which of the following is the correct Reduce phase output for the key 'data'?
Sum all the counts for the word.
The Reduce phase sums all the 1s for 'data', resulting in 4.