0
0
Apache Sparkdata~30 mins

Adding and renaming columns in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Adding and renaming columns
📖 Scenario: You work in a small bakery that keeps track of daily sales data in a Spark DataFrame. You want to add new information and rename columns to make the data easier to understand.
🎯 Goal: Learn how to add a new column to a Spark DataFrame and rename an existing column.
📋 What You'll Learn
Create a Spark DataFrame with specific sales data
Add a new column with calculated values
Rename an existing column
Display the final DataFrame
💡 Why This Matters
🌍 Real World
Adding and renaming columns is common when cleaning and preparing data for reports or analysis in business settings.
💼 Career
Data analysts and data scientists often need to modify DataFrames to add calculated fields and improve column names for better understanding.
Progress0 / 4 steps
1
Create the initial sales DataFrame
Create a Spark DataFrame called sales_df with these exact columns and data: date as strings and items_sold as integers. The rows are: ('2024-04-01', 30), ('2024-04-02', 45), ('2024-04-03', 50).
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names.

2
Add a new column for revenue
Create a variable called price_per_item and set it to 2.5. This is the price of one item in dollars.
Apache Spark
Need a hint?

Just create a variable named price_per_item and assign it the value 2.5.

3
Add revenue column using price_per_item
Add a new column called revenue to sales_df. Calculate revenue by multiplying items_sold by price_per_item. Use the withColumn method and lit function from pyspark.sql.functions.
Apache Spark
Need a hint?

Use withColumn('revenue', sales_df['items_sold'] * lit(price_per_item)) to add the new column.

4
Rename the items_sold column and show the DataFrame
Rename the column items_sold to quantity_sold using the withColumnRenamed method on sales_df. Then print the DataFrame using show().
Apache Spark
Need a hint?

Use withColumnRenamed('items_sold', 'quantity_sold') and then show() to display the DataFrame.