beginner

What is a UDF in Apache Spark?

A UDF (User Defined Function) is a custom function that you write to perform operations on data in Spark DataFrames when built-in functions are not enough.

Click to reveal answer

beginner

How do you register a Python function as a UDF in Spark?

You use the function spark.udf.register or pyspark.sql.functions.udf to convert a Python function into a Spark UDF that can be used in DataFrame operations.

Click to reveal answer

intermediate

Why should you avoid using UDFs when possible in Spark?

UDFs can be slower because they break Spark's optimization and run code outside of Spark's engine. Using built-in Spark functions is faster and better optimized.

Click to reveal answer

intermediate

What data types must you specify when creating a UDF in Spark?

You must specify the return data type of the UDF, like StringType, IntegerType, etc., so Spark knows how to handle the output.

Click to reveal answer

beginner

Example: How to create a UDF that adds 10 to a number in PySpark?

Define a Python function: def add_ten(x): return x + 10. Then create UDF: from pyspark.sql.functions import udf; from pyspark.sql.types import IntegerType; add_ten_udf = udf(add_ten, IntegerType()). Use it in DataFrame with df.withColumn('new_col', add_ten_udf(df['col'])).

Click to reveal answer

What does UDF stand for in Apache Spark?

AUnified Data Frame

BUser Defined Function

CUniversal Data Format

DUser Data File

Which Spark module is used to create UDFs in PySpark?

Apyspark.sql.functions

Bpyspark.ml

Cpyspark.streaming

Dpyspark.sql.types

Why might using a UDF slow down your Spark job?

ABecause UDFs run outside Spark's optimized engine

BBecause UDFs use too much memory

CBecause UDFs require internet connection

DBecause UDFs only work on small datasets

What must you specify when defining a UDF in Spark?

AThe file path

BThe input data type

CThe return data type

DThe Spark version

Which of these is a correct way to apply a UDF to a DataFrame column?

Adf.groupBy(my_udf)

Bdf.select(my_udf)

Cdf.filter(my_udf)

Ddf.withColumn('new_col', my_udf(df['col']))

Explain what a UDF is in Apache Spark and when you might need to use one.

Describe the steps to create and use a UDF in PySpark with a simple example.