Apache Sparkdata~10 mins

UDFs (User Defined Functions) in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - UDFs (User Defined Functions)

Define UDF function

↓

Apply UDF to DataFrame column

↓

Spark runs UDF on each row

↓

New column with UDF results

↓

Show or use transformed DataFrame

You first write a function, then tell Spark about it as a UDF. Spark applies it to each row, creating a new column with results.

Execution Sample

Apache Spark

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def greet(name):
    return f"Hello, {name}!"

greet_udf = udf(greet, StringType())
df = spark.createDataFrame([('Alice',), ('Bob',)], ['name'])
df2 = df.withColumn('greeting', greet_udf(df.name))
df2.show()

This code defines a UDF that adds a greeting to each name in the DataFrame.

Execution Table

Step	Action	Input	UDF Output	DataFrame State
1	Define function greet	name string	N/A	No DataFrame change
2	Register greet as UDF	greet function	UDF object created	No DataFrame change
3	Create DataFrame	[('Alice',), ('Bob',)]	N/A	DataFrame with column 'name' and 2 rows
4	Apply UDF to 'name' column	Row 1: 'Alice'	'Hello, Alice!'	New column 'greeting' added to row 1
5	Apply UDF to 'name' column	Row 2: 'Bob'	'Hello, Bob!'	New column 'greeting' added to row 2
6	Show DataFrame	N/A	N/A	DataFrame shows columns 'name' and 'greeting' with values

💡 All rows processed, UDF applied to each, DataFrame transformed with new column

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 6
greet	Function not defined	Function defined	Function defined	Function defined
greet_udf	None	UDF object created	UDF object created	UDF object created
df	None	None	DataFrame with 2 rows and 'name' column	DataFrame unchanged
df2	None	None	None	DataFrame with 'name' and 'greeting' columns

Key Moments - 3 Insights

Why do we need to register the Python function as a UDF before using it in Spark?

Does the UDF change the original DataFrame or create a new one?

What happens if the UDF is applied to a null or missing value?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the UDF output when the input name is 'Bob'?

A'Hello, Bob!'

B'Hello, Alice!'

C'Bob'

DNone

Concept Snapshot

UDFs let you run your own Python code on Spark DataFrames.
Define a Python function.
Register it as a UDF with a return type.
Apply it to DataFrame columns.
Spark runs it on each row, creating new columns.
Original DataFrame stays unchanged.

Full Transcript

User Defined Functions (UDFs) in Apache Spark allow you to apply your own Python functions to DataFrame columns. First, you define a Python function that does what you want, like adding a greeting to a name. Then, you register this function as a UDF with Spark, telling it the return type. After that, you apply the UDF to a DataFrame column. Spark runs your function on each row's value and creates a new column with the results. The original DataFrame does not change; instead, a new DataFrame with the added column is created. This process lets you customize data processing beyond built-in Spark functions.