0
0
Apache Sparkdata~30 mins

Column expressions and functions in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Column expressions and functions
📖 Scenario: You work as a data analyst for a retail company. You have sales data with product names and their prices. You want to create a new column that shows the price after applying a 10% discount.
🎯 Goal: Create a Spark DataFrame with product names and prices, define a discount rate, apply a column expression to calculate discounted prices, and display the final DataFrame.
📋 What You'll Learn
Create a Spark DataFrame named products_df with columns product and price using the exact data provided.
Create a variable named discount_rate and set it to 0.10.
Use Spark column expressions and functions to add a new column discounted_price to products_df that applies the discount.
Show the resulting DataFrame using show().
💡 Why This Matters
🌍 Real World
Retail companies often need to adjust prices dynamically, such as applying discounts or taxes, and analyze the updated prices.
💼 Career
Data analysts and data engineers use Spark column expressions to efficiently transform and analyze large datasets in real time.
Progress0 / 4 steps
1
Create the initial DataFrame
Create a Spark DataFrame called products_df with the following data: [('Apple', 100), ('Banana', 80), ('Cherry', 120)]. The columns should be named product and price.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and specify the column names.

2
Define the discount rate
Create a variable called discount_rate and set it to 0.10 to represent a 10% discount.
Apache Spark
Need a hint?

Just assign 0.10 to a variable named discount_rate.

3
Add discounted price column
Use Spark column expressions to add a new column called discounted_price to products_df. Calculate it by subtracting the discount from the original price using the discount_rate. Use the withColumn method and the col function from pyspark.sql.functions.
Apache Spark
Need a hint?

Use withColumn('discounted_price', col('price') * (1 - discount_rate)) to create the new column.

4
Show the final DataFrame
Use the show() method on products_df to display the DataFrame with the new discounted_price column.
Apache Spark
Need a hint?

Just call products_df.show() to display the table.