0
0
Hadoopdata~7 mins

User-defined functions (UDFs) in Hadoop

Choose your learning style9 modes available
Introduction

User-defined functions (UDFs) let you add your own special commands to process data in Hadoop. They help you do tasks that built-in commands cannot do easily.

When you want to clean or change data in a way Hadoop does not support by default.
When you need to calculate new values from existing data columns.
When you want to filter data using your own rules.
When you want to reuse a custom operation many times in your data processing.
When built-in functions are too slow or not flexible enough for your task.
Syntax
Hadoop
import org.apache.hadoop.hive.ql.exec.UDF;

public class MyCustomUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        // Your custom logic here
        return input.toUpperCase();
    }
}

The class must extend UDF from Hadoop Hive.

The method evaluate is where you write your code. It can have different input types and return types.

Examples
This UDF converts a string to uppercase. If the input is null, it returns null.
Hadoop
import org.apache.hadoop.hive.ql.exec.UDF;

public class ToUpperCaseUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        return input.toUpperCase();
    }
}
This UDF adds a prefix to the input string. It handles null inputs safely.
Hadoop
import org.apache.hadoop.hive.ql.exec.UDF;

public class AddPrefixUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        return "prefix_" + input;
    }
}
This UDF divides two numbers but returns null if the denominator is zero or null to avoid errors.
Hadoop
import org.apache.hadoop.hive.ql.exec.UDF;

public class SafeDivideUDF extends UDF {
    public Double evaluate(Double numerator, Double denominator) {
        if (denominator == null || denominator == 0) return null;
        return numerator / denominator;
    }
}
This UDF shows how to handle null input by returning a default string.
Hadoop
import org.apache.hadoop.hive.ql.exec.UDF;

public class NullInputUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return "empty";
        return input;
    }
}
Sample Program

This program defines a UDF that reverses a string. It then tests the UDF with a normal string and a null input, printing the results.

Hadoop
import org.apache.hadoop.hive.ql.exec.UDF;

public class ReverseStringUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        return new StringBuilder(input).reverse().toString();
    }
}

// Sample usage in Hive:
// CREATE TEMPORARY FUNCTION reverse_string AS 'ReverseStringUDF';
// SELECT reverse_string(name) FROM users;

class TestReverseStringUDF {
    public static void main(String[] args) {
        ReverseStringUDF reverseStringUDF = new ReverseStringUDF();

        String original = "hadoop";
        System.out.println("Original: " + original);

        String reversed = reverseStringUDF.evaluate(original);
        System.out.println("Reversed: " + reversed);

        String nullInput = null;
        System.out.println("Null input reversed: " + reverseStringUDF.evaluate(nullInput));
    }
}
OutputSuccess
Important Notes

Time complexity is usually O(n) where n is the input size, depending on your logic.

Space complexity depends on what you store; simple UDFs use little extra space.

Common mistake: forgetting to handle null inputs, which can cause errors.

Use UDFs when built-in functions do not meet your needs or for reusable custom logic.

Summary

User-defined functions let you add your own data processing steps in Hadoop.

They must extend the UDF class and implement an evaluate method.

Always handle null inputs and test your UDF with different cases.