What is Hadoop - Complexity Analysis
When working with Hadoop, it is important to understand how the time it takes to process data changes as the data size grows.
We want to know how the work done by Hadoop scales when we add more data.
Analyze the time complexity of the following code snippet.
// Hadoop MapReduce job example
public class WordCount {
public static class Mapper extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object key, Text value, Context context) {
// split line into words and emit each word with count 1
}
}
public static class Reducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
// sum counts for each word and write result
}
}
}
This code counts how many times each word appears in a large set of text data using Hadoop's MapReduce.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: The Mapper processes each line and splits it into words, then the Reducer sums counts for each word.
- How many times: The Mapper runs once for each line of input, and the Reducer runs once for each unique word.
As the input data grows, the Mapper processes more lines, and the Reducer handles more unique words.
| Input Size (n lines) | Approx. Operations |
|---|---|
| 10 | Processes 10 lines, counts words in those lines |
| 100 | Processes 100 lines, counts words in those lines |
| 1000 | Processes 1000 lines, counts words in those lines |
Pattern observation: The work grows roughly in direct proportion to the number of lines of input.
Time Complexity: O(n)
This means the time to complete the job grows linearly as the input data size increases.
[X] Wrong: "The Reducer runs once for every line of input, so time grows faster than linearly."
[OK] Correct: The Reducer runs once per unique word, not per line, so the main growth depends on input size and word variety, not just lines.
Understanding how Hadoop scales with data size shows you can think about big data jobs clearly, a useful skill for many data roles.
"What if the input data had many repeated words versus many unique words? How would that affect the time complexity?"