0
0
Apache Sparkdata~10 mins

Adding and renaming columns in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Adding and renaming columns
Start with DataFrame
Add new column with value or expression
Rename existing column
Result: DataFrame with updated columns
End
Start with a DataFrame, add new columns using expressions, rename columns, and get the updated DataFrame.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = df.withColumn("age", df.id + 20).withColumnRenamed("name", "first_name")
df2.show()
Create a DataFrame, add a new column 'age' by adding 20 to 'id', rename 'name' to 'first_name', then show the result.
Execution Table
StepActionDataFrame ColumnsNotes
1Create DataFrame[id, name]Initial DataFrame with 2 columns
2Add column 'age' = id + 20[id, name, age]New column 'age' added
3Rename column 'name' to 'first_name'[id, first_name, age]Column 'name' renamed
4Show DataFrame[id, first_name, age]Displays updated DataFrame
5End[id, first_name, age]No more changes
💡 All columns added and renamed as requested, execution ends.
Variable Tracker
VariableStartAfter Step 2After Step 3Final
df.columns['id', 'name']['id', 'name']['id', 'name']['id', 'name']
df2.columnsN/A['id', 'name', 'age']['id', 'first_name', 'age']['id', 'first_name', 'age']
Key Moments - 2 Insights
Why does df.columns not change after adding or renaming columns?
Because df is immutable in Spark. Adding or renaming columns creates a new DataFrame (df2), leaving df unchanged, as shown in execution_table steps 2 and 3.
What happens if you rename a column that does not exist?
Spark will throw an error. The execution_table assumes the column exists before renaming, so renaming must be done on existing columns only.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what columns does df2 have after step 2?
A[id, name]
B[id, first_name, age]
C[id, name, age]
D[id, age]
💡 Hint
Check the 'DataFrame Columns' column for step 2 in execution_table.
At which step does the column 'name' get renamed?
AStep 3
BStep 2
CStep 1
DStep 4
💡 Hint
Look for the action mentioning renaming in execution_table.
If you add a column 'age' as df.id + 10 instead of +20, how would df2.columns after step 2 change?
AColumns would be [id, first_name, age]
BColumns would be [id, name, age] but age values differ
CColumns would be [id, name]
DColumns would be [id, name, age, age2]
💡 Hint
Adding a column changes columns list but expression affects values, not column names.
Concept Snapshot
Adding and renaming columns in Spark:
- Use withColumn('new_col', expr) to add or replace columns.
- Use withColumnRenamed('old', 'new') to rename columns.
- DataFrames are immutable; these return new DataFrames.
- Chain methods to add and rename in one step.
- Show() displays the updated DataFrame.
Full Transcript
We start with a Spark DataFrame with columns 'id' and 'name'. We add a new column 'age' by adding 20 to 'id' using withColumn. Then we rename the 'name' column to 'first_name' using withColumnRenamed. Each operation returns a new DataFrame, so the original stays unchanged. Finally, we show the updated DataFrame with columns 'id', 'first_name', and 'age'.