UDFs (User Defined Functions) in Apache Spark - Time & Space Complexity
When using UDFs in Apache Spark, it's important to understand how their execution time grows as data size increases.
We want to know how the cost of applying a UDF changes when we have more rows to process.
Analyze the time complexity of the following code snippet.
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def square(x):
return x * x
square_udf = udf(square, IntegerType())
df = spark.range(0, 1000000)
df_with_square = df.withColumn('square', square_udf(df['id']))
This code defines a UDF to square numbers and applies it to each row in a Spark DataFrame.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Applying the UDF function to each row in the DataFrame.
- How many times: Once per row, so as many times as there are rows (n times).
Each row requires one call to the UDF, so the total work grows directly with the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 UDF calls |
| 100 | 100 UDF calls |
| 1000 | 1000 UDF calls |
Pattern observation: Doubling the number of rows doubles the number of UDF calls.
Time Complexity: O(n)
This means the time to run the UDF grows linearly with the number of rows in the DataFrame.
[X] Wrong: "UDFs run once and apply to all rows at the same time, so time does not grow with data size."
[OK] Correct: Each row must be processed individually by the UDF, so more rows mean more calls and more time.
Understanding UDF time complexity helps you explain performance impacts in Spark jobs and shows you can reason about scaling data operations.
What if we replaced the UDF with a built-in Spark function? How would the time complexity change?