0
0
Apache Sparkdata~5 mins

Adding and renaming columns in Apache Spark

Choose your learning style9 modes available
Introduction

We add new columns to include more information and rename columns to make data clearer and easier to understand.

You want to calculate a new value from existing data and save it as a new column.
You need to change a column name to match a report or standard format.
You want to prepare data before analysis by adding helpful labels or categories.
You want to fix unclear or confusing column names in your dataset.
Syntax
Apache Spark
from pyspark.sql.functions import col

# Adding a new column
new_df = df.withColumn('new_column_name', expression)

# Renaming a column
renamed_df = df.withColumnRenamed('old_name', 'new_name')

withColumn creates a new column or replaces an existing one.

withColumnRenamed changes the name of one column at a time.

Examples
This adds a new column 'double_value' by multiplying 'value' by 2.
Apache Spark
df = spark.createDataFrame([(1, 10), (2, 20)], ['id', 'value'])
df2 = df.withColumn('double_value', col('value') * 2)
This renames the column 'double_value' to 'value_times_two'.
Apache Spark
df3 = df2.withColumnRenamed('double_value', 'value_times_two')
Sample Program

This program creates a DataFrame with 'id' and 'amount'. It adds a new column 'amount_plus_tax' by adding 10% tax. Then it renames this new column to 'total_amount'. Finally, it shows the updated data.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('AddRenameColumns').getOrCreate()

# Create initial DataFrame
data = [(1, 100), (2, 200), (3, 300)]
df = spark.createDataFrame(data, ['id', 'amount'])

# Add a new column 'amount_plus_tax' (10% tax)
df_with_tax = df.withColumn('amount_plus_tax', col('amount') * 1.1)

# Rename 'amount_plus_tax' to 'total_amount'
df_final = df_with_tax.withColumnRenamed('amount_plus_tax', 'total_amount')

# Show final DataFrame
df_final.show()
OutputSuccess
Important Notes

Adding columns does not change the original DataFrame; it returns a new one.

Renaming columns is useful before saving or sharing data to avoid confusion.

Summary

Use withColumn to add or update columns.

Use withColumnRenamed to rename columns.

These methods help keep your data clear and ready for analysis.