User-defined functions (UDFs) in Hadoop - Time & Space Complexity
When using user-defined functions (UDFs) in Hadoop, it's important to know how they affect the time it takes to process data.
We want to understand how the work grows as the input data gets bigger when UDFs are involved.
Analyze the time complexity of the following Hadoop UDF example.
public class MyUDF extends UDF {
public IntWritable evaluate(Text input) {
int sum = 0;
String[] parts = input.toString().split(",");
for (String part : parts) {
sum += Integer.parseInt(part);
}
return new IntWritable(sum);
}
}
This UDF takes a comma-separated string, splits it, converts each part to a number, and sums them up.
Look at what repeats as the input grows.
- Primary operation: Loop over each element after splitting the string.
- How many times: Once for each part in the input string (depends on input size).
As the number of parts in the input string grows, the UDF does more work.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 splits and sums |
| 100 | About 100 splits and sums |
| 1000 | About 1000 splits and sums |
Pattern observation: The work grows directly with the number of parts; doubling input doubles work.
Time Complexity: O(n)
This means the time to run the UDF grows linearly with the input size.
[X] Wrong: "The UDF runs in constant time no matter the input size."
[OK] Correct: The UDF loops over each part of the input, so more parts mean more work and longer time.
Understanding how UDFs scale helps you write efficient data processing code and explain your choices clearly in interviews.
"What if the UDF used nested loops over the input parts? How would the time complexity change?"