MapReduce processes large data by splitting it into parts. What is the main reason for splitting data into chunks?
Think about how splitting helps speed up work by sharing it.
MapReduce splits data so many machines can work on different parts simultaneously, speeding up processing.
MapReduce has two main steps: map and reduce. Why is this two-step process important for parallel data processing?
Consider how independent work and combining results help in parallel tasks.
The map step works on data chunks independently in parallel, and the reduce step gathers and combines these results to produce the final output.
Given a MapReduce job that counts words in the text: 'cat dog cat bird', what is the output after the reduce step?
Input text: 'cat dog cat bird' Map step output (key-value pairs): [('cat',1), ('dog',1), ('cat',1), ('bird',1)] Reduce step sums counts for each word.
Count how many times each word appears in the input.
The word 'cat' appears twice, 'dog' once, and 'bird' once, so the reduce step sums these counts accordingly.
A MapReduce job is running slower than expected. Which of the following is the most likely cause related to parallel processing?
Think about how workload balance affects parallel speed.
If data chunks are uneven, some machines finish early and wait for others, causing slow overall processing.
You have a dataset with many small files. Which data splitting strategy will best improve MapReduce parallel processing efficiency?
Consider the overhead of starting many small tasks versus fewer larger tasks.
Combining small files reduces the overhead of managing many tiny tasks, improving parallel processing efficiency in MapReduce.