0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Rename a Column in PySpark DataFrame

In PySpark, you can rename a column using the withColumnRenamed method on a DataFrame. This method takes the current column name and the new column name as arguments and returns a new DataFrame with the renamed column.
๐Ÿ“

Syntax

The withColumnRenamed method is used to rename a column in a PySpark DataFrame.

  • df.withColumnRenamed(existingName, newName)
  • existingName: The current name of the column you want to rename.
  • newName: The new name you want to assign to the column.
  • This method returns a new DataFrame with the column renamed; it does not change the original DataFrame.
python
new_df = df.withColumnRenamed("old_column_name", "new_column_name")
๐Ÿ’ป

Example

This example shows how to rename a column named age to years in a PySpark DataFrame.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RenameColumnExample").getOrCreate()

# Create sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Cathy", 22)]
columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)

# Rename column 'age' to 'years'
new_df = df.withColumnRenamed("age", "years")

# Show original and new DataFrames
print("Original DataFrame:")
df.show()

print("DataFrame after renaming column:")
new_df.show()
Output
+---+-----+---+ | id| name|age| +---+-----+---+ | 1|Alice| 25| | 2| Bob| 30| | 3|Cathy| 22| +---+-----+---+ +---+-----+-----+ | id| name|years| +---+-----+-----+ | 1|Alice| 25| | 2| Bob| 30| | 3|Cathy| 22| +---+-----+-----+
โš ๏ธ

Common Pitfalls

  • Trying to rename a column that does not exist will not raise an error but will leave the DataFrame unchanged.
  • Remember that withColumnRenamed returns a new DataFrame; it does not modify the original one.
  • Renaming multiple columns requires chaining withColumnRenamed calls or using other methods.
python
wrong_df = df.withColumnRenamed("non_existing_column", "new_name")  # This does nothing

# Correct way to rename multiple columns:
renamed_df = df.withColumnRenamed("id", "user_id").withColumnRenamed("age", "years")
๐Ÿ“Š

Quick Reference

MethodDescriptionExample
withColumnRenamedRename one columndf.withColumnRenamed('old', 'new')
selectExprRename multiple columns with SQL expressionsdf.selectExpr('old AS new', 'col2')
toDFRename all columns by providing new namesdf.toDF('new1', 'new2', 'new3')
โœ…

Key Takeaways

Use withColumnRenamed to rename a single column in a PySpark DataFrame.
withColumnRenamed returns a new DataFrame; original DataFrame stays unchanged.
Renaming multiple columns requires chaining withColumnRenamed or other methods.
Trying to rename a non-existing column does not cause an error but has no effect.
Use toDF or selectExpr for renaming multiple columns efficiently.