Apache Sparkdata~10 mins

Adding and renaming columns in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Adding and renaming columns

Start with DataFrame

↓

Add new column with value or expression

↓

Rename existing column

↓

Result: DataFrame with updated columns

↓

End

Start with a DataFrame, add new columns using expressions, rename columns, and get the updated DataFrame.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = df.withColumn("age", df.id + 20).withColumnRenamed("name", "first_name")
df2.show()

Create a DataFrame, add a new column 'age' by adding 20 to 'id', rename 'name' to 'first_name', then show the result.

Execution Table

Step	Action	DataFrame Columns	Notes
1	Create DataFrame	[id, name]	Initial DataFrame with 2 columns
2	Add column 'age' = id + 20	[id, name, age]	New column 'age' added
3	Rename column 'name' to 'first_name'	[id, first_name, age]	Column 'name' renamed
4	Show DataFrame	[id, first_name, age]	Displays updated DataFrame
5	End	[id, first_name, age]	No more changes

💡 All columns added and renamed as requested, execution ends.

Variable Tracker

Variable	Start	After Step 2	After Step 3	Final
df.columns	['id', 'name']	['id', 'name']	['id', 'name']	['id', 'name']
df2.columns	N/A	['id', 'name', 'age']	['id', 'first_name', 'age']	['id', 'first_name', 'age']

Key Moments - 2 Insights

Why does df.columns not change after adding or renaming columns?

What happens if you rename a column that does not exist?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what columns does df2 have after step 2?

A[id, name]

B[id, first_name, age]

C[id, name, age]

D[id, age]

Concept Snapshot

Adding and renaming columns in Spark:
- Use withColumn('new_col', expr) to add or replace columns.
- Use withColumnRenamed('old', 'new') to rename columns.
- DataFrames are immutable; these return new DataFrames.
- Chain methods to add and rename in one step.
- Show() displays the updated DataFrame.

Full Transcript

We start with a Spark DataFrame with columns 'id' and 'name'. We add a new column 'age' by adding 20 to 'id' using withColumn. Then we rename the 'name' column to 'first_name' using withColumnRenamed. Each operation returns a new DataFrame, so the original stays unchanged. Finally, we show the updated DataFrame with columns 'id', 'first_name', and 'age'.