0
0
Apache Sparkdata~20 mins

UDFs (User Defined Functions) in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
UDF Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of a simple UDF application
What is the output of this Spark code that uses a UDF to square numbers in a DataFrame?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.master('local').appName('UDF Test').getOrCreate()
data = [(1,), (2,), (3,)]
df = spark.createDataFrame(data, ['num'])

square_udf = udf(lambda x: x * x, IntegerType())
df2 = df.withColumn('square', square_udf(df['num']))
df2.show()
A
+---+------+
|num|square|
+---+------+
|  1|     1|
|  2|     4|
|  3|     9|
+---+------+
BTypeError: 'int' object is not callable
CSyntaxError: invalid syntax
D
+---+------+
|num|square|
+---+------+
|  1|     2|
|  2|     4|
|  3|     6|
+---+------+
Attempts:
2 left
💡 Hint
Think about what the lambda function does to each number.
data_output
intermediate
2:00remaining
Result of UDF with string manipulation
Given this Spark code with a UDF that reverses strings, what is the resulting DataFrame output?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.master('local').appName('UDF String').getOrCreate()
data = [('apple',), ('banana',), ('cherry',)]
df = spark.createDataFrame(data, ['fruit'])

reverse_udf = udf(lambda s: s[::-1], StringType())
df2 = df.withColumn('reversed', reverse_udf(df['fruit']))
df2.show()
A
+------+--------+
| fruit|reversed|
+------+--------+
| apple|  elppa|
|banana| ananab|
|cherry| yrrehc|
+------+--------+
B
+------+--------+
| fruit|reversed|
+------+--------+
| apple|  apple|
|banana| banana|
|cherry| cherry|
+------+--------+
CTypeError: 'NoneType' object is not subscriptable
DSyntaxError: invalid syntax
Attempts:
2 left
💡 Hint
The lambda reverses the string using slicing.
🔧 Debug
advanced
2:00remaining
Identify the error in this UDF code
What error will this Spark code raise when trying to apply the UDF?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.master('local').appName('UDF Error').getOrCreate()
data = [(1,), (2,), (3,)]
df = spark.createDataFrame(data, ['num'])

def add_five(x):
    return x + 5

add_five_udf = udf(add_five)
df2 = df.withColumn('plus_five', add_five_udf(df['num']))
df2.show()
ASyntaxError: invalid syntax
BTypeError: add_five() missing 1 required positional argument: 'x'
C
+---+---------+
|num|plus_five|
+---+---------+
|  1|        6|
|  2|        7|
|  3|        8|
+---+---------+
DAnalysisException: UDF return type must be specified
Attempts:
2 left
💡 Hint
Check how the UDF is defined and registered.
🚀 Application
advanced
2:30remaining
Using UDF to categorize ages
You want to add a new column 'age_group' to a Spark DataFrame with ages, categorizing ages as 'child' (<18), 'adult' (18-64), or 'senior' (65+). Which UDF definition and usage will produce the correct output?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.master('local').appName('Age Group').getOrCreate()
data = [(5,), (30,), (70,)]
df = spark.createDataFrame(data, ['age'])
A
age_group_udf = udf(lambda age: 'child' if age &lt; 18 else 'adult' if age &lt; 65 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()
B
age_group_udf = udf(lambda age: 'child' if age &lt; 18 else 'adult' if age &lt;= 64 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()
C
age_group_udf = udf(lambda age: 'child' if age &lt;= 18 else 'adult' if age &lt;= 65 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()
D
age_group_udf = udf(lambda age: 'child' if age &lt;= 18 else 'adult' if age &lt; 65 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()
Attempts:
2 left
💡 Hint
Check the age boundaries carefully for each group.
🧠 Conceptual
expert
1:30remaining
Why avoid UDFs when possible in Spark?
Which is the main reason to avoid using UDFs in Spark when native functions exist?
AUDFs cannot handle null values, causing runtime exceptions.
BUDFs always cause syntax errors in Spark SQL queries.
CUDFs are slower because they break Spark's optimization and serialization, causing performance loss.
DUDFs automatically convert data to Pandas DataFrames, which is inefficient.
Attempts:
2 left
💡 Hint
Think about how Spark optimizes queries and executes code.