Apache Sparkdata~10 mins

Type casting and null handling in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Type casting and null handling

Start with DataFrame

↓

Select column to cast

↓

Apply cast to target type

↓

Check for nulls

↓

Handle nulls: fill/drop

↓

Nulls handled

↓

Resulting DataFrame with casted types and nulls handled

↓

End

Start with a DataFrame, cast columns to new types, check for nulls, handle them by filling or dropping, then get the final cleaned DataFrame.

Execution Sample

Apache Spark

from pyspark.sql.functions import col

df = spark.createDataFrame([(1, '10'), (2, None), (3, '30')], ['id', 'value'])
df2 = df.withColumn('value_int', col('value').cast('int'))
df3 = df2.na.fill({'value_int': 0})

Create a DataFrame, cast 'value' column to integer, then fill nulls in 'value_int' with 0.

Execution Table

Step	Action	Column 'value'	Column 'value_int'	Null Handling	Output
1	Create DataFrame	['10', None, '30']	N/A	N/A	[{'id':1,'value':'10'}, {'id':2,'value':None}, {'id':3,'value':'30'}]
2	Cast 'value' to int	['10', None, '30']	[10, null, 30]	N/A	[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':null}, {'id':3,'value':'30','value_int':30}]
3	Fill nulls in 'value_int' with 0	['10', None, '30']	[10, 0, 30]	Null replaced with 0	[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':0}, {'id':3,'value':'30','value_int':30}]

💡 All nulls in 'value_int' replaced, casting complete.

Variable Tracker

Variable	Start	After Step 2	After Step 3
df	[{'id':1,'value':'10'}, {'id':2,'value':None}, {'id':3,'value':'30'}]	Same	Same
df2	N/A	[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':null}, {'id':3,'value':'30','value_int':30}]	Same
df3	N/A	N/A	[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':0}, {'id':3,'value':'30','value_int':30}]

Key Moments - 2 Insights

Why does casting a string column with None values to int produce nulls instead of errors?

How does filling nulls with a value like 0 affect the DataFrame?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at Step 2, what is the value of 'value_int' for the row where 'value' is None?

A10

Bnull

DNone

Concept Snapshot

Type casting changes column data types (e.g., string to int).
Nulls appear when casting invalid or missing data.
Use na.fill() or na.drop() to handle nulls.
Filling nulls prevents errors in later processing.
Always check data after casting for nulls.

Full Transcript

We start with a DataFrame containing strings and None values. We cast the string column to integers. Valid strings become integers, but None becomes null. Then, we fill nulls with zero to avoid problems later. This process ensures the data is clean and ready for analysis.