What is User-defined functions (UDFs) in Hadoop?

Hadoopdata~7 mins

User-defined functions (UDFs) in Hadoop

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

User-defined functions (UDFs) let you add your own special commands to process data in Hadoop. They help you do tasks that built-in commands cannot do easily.

When you want to clean or change data in a way Hadoop does not support by default.

When you need to calculate new values from existing data columns.

When you want to filter data using your own rules.

When you want to reuse a custom operation many times in your data processing.

When built-in functions are too slow or not flexible enough for your task.

Syntax

Hadoop

import org.apache.hadoop.hive.ql.exec.UDF;

public class MyCustomUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        // Your custom logic here
        return input.toUpperCase();
    }
}

The class must extend UDF from Hadoop Hive.

The method evaluate is where you write your code. It can have different input types and return types.

Examples

This UDF converts a string to uppercase. If the input is null, it returns null.

Hadoop

import org.apache.hadoop.hive.ql.exec.UDF;

public class ToUpperCaseUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        return input.toUpperCase();
    }
}

This UDF adds a prefix to the input string. It handles null inputs safely.

Hadoop

import org.apache.hadoop.hive.ql.exec.UDF;

public class AddPrefixUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        return "prefix_" + input;
    }
}

This UDF divides two numbers but returns null if the denominator is zero or null to avoid errors.

Hadoop

import org.apache.hadoop.hive.ql.exec.UDF;

public class SafeDivideUDF extends UDF {
    public Double evaluate(Double numerator, Double denominator) {
        if (denominator == null || denominator == 0) return null;
        return numerator / denominator;
    }
}

This UDF shows how to handle null input by returning a default string.

Hadoop

import org.apache.hadoop.hive.ql.exec.UDF;

public class NullInputUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return "empty";
        return input;
    }
}

Sample Program

This program defines a UDF that reverses a string. It then tests the UDF with a normal string and a null input, printing the results.

Hadoop

import org.apache.hadoop.hive.ql.exec.UDF;

public class ReverseStringUDF extends UDF {
    public String evaluate(String input) {
        if (input == null) return null;
        return new StringBuilder(input).reverse().toString();
    }
}

// Sample usage in Hive:
// CREATE TEMPORARY FUNCTION reverse_string AS 'ReverseStringUDF';
// SELECT reverse_string(name) FROM users;

class TestReverseStringUDF {
    public static void main(String[] args) {
        ReverseStringUDF reverseStringUDF = new ReverseStringUDF();

        String original = "hadoop";
        System.out.println("Original: " + original);

        String reversed = reverseStringUDF.evaluate(original);
        System.out.println("Reversed: " + reversed);

        String nullInput = null;
        System.out.println("Null input reversed: " + reverseStringUDF.evaluate(nullInput));
    }
}

OutputSuccess

Important Notes

Time complexity is usually O(n) where n is the input size, depending on your logic.

Space complexity depends on what you store; simple UDFs use little extra space.

Common mistake: forgetting to handle null inputs, which can cause errors.

Use UDFs when built-in functions do not meet your needs or for reusable custom logic.

Summary

User-defined functions let you add your own data processing steps in Hadoop.

They must extend the UDF class and implement an evaluate method.

Always handle null inputs and test your UDF with different cases.