0
0
Apache Sparkdata~30 mins

UDFs (User Defined Functions) in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Using UDFs (User Defined Functions) in Apache Spark
📖 Scenario: You work at a retail company. You have sales data with product names and prices. You want to add a new column that shows the price category: 'Cheap' if price is less than 20, 'Moderate' if price is between 20 and 50, and 'Expensive' if price is above 50.
🎯 Goal: Create a Spark DataFrame with product data, define a User Defined Function (UDF) to categorize prices, apply it to add a new column, and display the result.
📋 What You'll Learn
Create a Spark DataFrame with exact product and price data
Define a UDF named price_category that categorizes prices
Use the UDF to add a new column category to the DataFrame
Show the final DataFrame with the new column
💡 Why This Matters
🌍 Real World
In real companies, UDFs help add custom logic to big data processing pipelines when built-in functions are not enough.
💼 Career
Knowing how to write and use UDFs is important for data engineers and data scientists working with Apache Spark to transform and analyze large datasets.
Progress0 / 4 steps
1
Create the initial Spark DataFrame
Create a Spark DataFrame called df with these exact rows: ("Pen", 10), ("Notebook", 25), ("Backpack", 60). Use columns named product and price.
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of tuples and specify column names.

2
Define the UDF function
Define a Python function called price_category that takes a single argument price and returns the string 'Cheap' if price is less than 20, 'Moderate' if price is between 20 and 50 (inclusive), and 'Expensive' if price is greater than 50.
Apache Spark
Need a hint?

Use simple if-elif-else statements to return the correct category string.

3
Register the UDF and apply it to the DataFrame
Import udf and StringType from pyspark.sql.functions and pyspark.sql.types. Register the price_category function as a UDF named price_category_udf with return type StringType(). Use withColumn on df to add a new column called category by applying price_category_udf to the price column. Save the result back to df.
Apache Spark
Need a hint?

Use udf(function, returnType) to register, then withColumn to add the new column.

4
Show the final DataFrame
Use print to display the result of df.show().
Apache Spark
Need a hint?

Use df.show() inside a print statement to display the table.