0
0
Apache Sparkdata~5 mins

Type casting and null handling in Apache Spark

Choose your learning style9 modes available
Introduction

Type casting changes data from one type to another. Null handling manages missing or empty data safely.

When you want to convert a column from string to number to do math.
When you need to replace missing values to avoid errors in calculations.
When reading data from files that may have empty or null fields.
When preparing data for machine learning models that require specific types.
When cleaning data to ensure consistent types across columns.
Syntax
Apache Spark
from pyspark.sql.functions import col

df = df.withColumn('new_column', col('old_column').cast('new_type'))

# To handle nulls, use functions like fillna or when

df = df.fillna({'column_name': default_value})

Use cast() to convert column types.

Use fillna() to replace nulls with default values.

Examples
Convert the 'age' column from string to integer.
Apache Spark
df = df.withColumn('age_int', col('age').cast('int'))
Replace nulls in 'age_int' column with 0.
Apache Spark
df = df.fillna({'age_int': 0})
Cast 'age' to int and replace nulls with 0 in one step.
Apache Spark
from pyspark.sql.functions import when

df = df.withColumn('age_clean', when(col('age').isNull(), 0).otherwise(col('age').cast('int')))
Sample Program

This program creates a DataFrame with ages as strings, some nulls and 'NaN' strings. It replaces 'NaN' with null, then casts the age to integer and replaces nulls with 0.

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName('TypeCastingNullHandling').getOrCreate()

# Sample data with string ages and some nulls
data = [
    ('Alice', '23'),
    ('Bob', None),
    ('Charlie', '35'),
    ('David', 'NaN'),
    ('Eve', None)
]

columns = ['name', 'age']

df = spark.createDataFrame(data, columns)

# Show original data
print('Original Data:')
df.show()

# Replace 'NaN' string with null
from pyspark.sql.functions import when

df = df.withColumn('age', when(col('age') == 'NaN', None).otherwise(col('age')))

# Cast age to integer and replace nulls with 0

df = df.withColumn('age_int', when(col('age').isNull(), 0).otherwise(col('age').cast('int')))

print('After type casting and null handling:')
df.show()

spark.stop()
OutputSuccess
Important Notes

Null values can cause errors if not handled before calculations.

Always check data types before casting to avoid unexpected errors.

Use when for flexible null handling and conditional logic.

Summary

Type casting changes data types to fit your analysis needs.

Null handling replaces or manages missing data safely.

Use Spark functions like cast(), fillna(), and when() for these tasks.