How to Use withColumn in PySpark: Simple Guide
In PySpark, use
withColumn to add a new column or change an existing one in a DataFrame by specifying the column name and an expression. It returns a new DataFrame with the updated column without changing the original data.Syntax
The withColumn method takes two main arguments: the name of the column to add or update, and the expression or value to assign to that column. It returns a new DataFrame with the change applied.
- columnName: String name of the new or existing column.
- col: Expression or value to assign to the column, often using PySpark functions.
python
DataFrame.withColumn(colName: str, col: Column) -> DataFrameExample
This example shows how to add a new column called age_plus_one by adding 1 to the existing age column in a PySpark DataFrame.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.master('local').appName('Example').getOrCreate() data = [(1, 'Alice', 20), (2, 'Bob', 30)] columns = ['id', 'name', 'age'] df = spark.createDataFrame(data, columns) # Add a new column 'age_plus_one' by adding 1 to 'age' df2 = df.withColumn('age_plus_one', col('age') + 1) df2.show()
Output
+---+-----+---+------------+
| id| name|age|age_plus_one|
+---+-----+---+------------+
| 1|Alice| 20| 21|
| 2| Bob| 30| 31|
+---+-----+---+------------+
Common Pitfalls
Common mistakes when using withColumn include:
- Trying to modify the DataFrame in place instead of assigning the result to a new variable or the same variable.
- Using Python native operations instead of PySpark column expressions, which causes errors.
- Overwriting important columns unintentionally without checking.
Always remember that withColumn returns a new DataFrame and does not change the original.
python
from pyspark.sql.functions import lit # Wrong: This does not change df # df.withColumn('new_col', lit(100)) # Right: Assign the result back to df or a new variable df = df.withColumn('new_col', lit(100))
Quick Reference
| Parameter | Description |
|---|---|
| colName | Name of the column to add or update |
| col | Expression or value to assign to the column |
| Returns | New DataFrame with the added or updated column |
Key Takeaways
Use withColumn to add or update columns in a PySpark DataFrame.
withColumn returns a new DataFrame; always assign it to a variable.
Use PySpark column expressions, not plain Python operations, inside withColumn.
Check column names to avoid overwriting important data unintentionally.
withColumn is useful for simple transformations and creating new features.