0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Cast Column Type in PySpark: Simple Guide

In PySpark, you can change a column's data type using the cast() method on a column object. For example, use df.withColumn('col_name', df['col_name'].cast('new_type')) to cast a column to a new type like IntegerType() or StringType().
๐Ÿ“

Syntax

The basic syntax to cast a column type in PySpark is:

  • df.withColumn('column_name', df['column_name'].cast('target_type'))

Here, df is your DataFrame, withColumn creates a new column or replaces an existing one, cast() changes the data type, and target_type is the desired data type as a string.

python
df = df.withColumn('column_name', df['column_name'].cast('target_type'))
๐Ÿ’ป

Example

This example shows how to cast a string column to integer type in a PySpark DataFrame.

python
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.appName('CastExample').getOrCreate()
data = [('1',), ('2',), ('3',)]
columns = ['number_str']
df = spark.createDataFrame(data, columns)

# Cast 'number_str' from string to integer
new_df = df.withColumn('number_int', df['number_str'].cast(IntegerType()))

new_df.show()
Output
+----------+----------+ |number_str|number_int| +----------+----------+ | 1| 1| | 2| 2| | 3| 3| +----------+----------+
โš ๏ธ

Common Pitfalls

Common mistakes when casting columns include:

  • Using an invalid type string or type object in cast().
  • Trying to cast non-convertible values (like letters to integers), which results in null values.
  • Not assigning the result of withColumn back to a DataFrame, so changes are lost.

Always check your data before casting and handle nulls if needed.

python
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.appName('CastPitfall').getOrCreate()
data = [('a',), ('2',), ('3',)]
columns = ['number_str']
df = spark.createDataFrame(data, columns)

# Wrong: casting without assignment (no change)
df.withColumn('number_int', df['number_str'].cast(IntegerType()))
df.show()

# Right: assign the casted DataFrame
new_df = df.withColumn('number_int', df['number_str'].cast(IntegerType()))
new_df.show()
Output
+----------+ |number_str| +----------+ | a| | 2| | 3| +----------+ +----------+----------+ |number_str|number_int| +----------+----------+ | a| null| | 2| 2| | 3| 3| +----------+----------+
๐Ÿ“Š

Quick Reference

Common data types you can cast to in PySpark include:

Data TypeDescription
StringType or 'string'Text data
IntegerType or 'int'Whole numbers
FloatType or 'float'Decimal numbers
DoubleType or 'double'Double precision decimals
BooleanType or 'boolean'True/False values
TimestampType or 'timestamp'Date and time values
โœ…

Key Takeaways

Use df.withColumn() with cast() to change a column's data type in PySpark.
Always assign the result of withColumn() back to a DataFrame to keep changes.
Casting invalid values results in nulls, so check data before casting.
You can cast using either type strings like 'int' or PySpark type objects like IntegerType().
Common target types include string, int, float, boolean, and timestamp.