0
0
Apache Sparkdata~10 mins

Type casting and null handling in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Type casting and null handling
Start with DataFrame
Select column to cast
Apply cast to target type
Check for nulls
Handle nulls: fill/drop
Nulls handled
Resulting DataFrame with casted types and nulls handled
End
Start with a DataFrame, cast columns to new types, check for nulls, handle them by filling or dropping, then get the final cleaned DataFrame.
Execution Sample
Apache Spark
from pyspark.sql.functions import col

df = spark.createDataFrame([(1, '10'), (2, None), (3, '30')], ['id', 'value'])
df2 = df.withColumn('value_int', col('value').cast('int'))
df3 = df2.na.fill({'value_int': 0})
Create a DataFrame, cast 'value' column to integer, then fill nulls in 'value_int' with 0.
Execution Table
StepActionColumn 'value'Column 'value_int'Null HandlingOutput
1Create DataFrame['10', None, '30']N/AN/A[{'id':1,'value':'10'}, {'id':2,'value':None}, {'id':3,'value':'30'}]
2Cast 'value' to int['10', None, '30'][10, null, 30]N/A[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':null}, {'id':3,'value':'30','value_int':30}]
3Fill nulls in 'value_int' with 0['10', None, '30'][10, 0, 30]Null replaced with 0[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':0}, {'id':3,'value':'30','value_int':30}]
💡 All nulls in 'value_int' replaced, casting complete.
Variable Tracker
VariableStartAfter Step 2After Step 3
df[{'id':1,'value':'10'}, {'id':2,'value':None}, {'id':3,'value':'30'}]SameSame
df2N/A[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':null}, {'id':3,'value':'30','value_int':30}]Same
df3N/AN/A[{'id':1,'value':'10','value_int':10}, {'id':2,'value':None,'value_int':0}, {'id':3,'value':'30','value_int':30}]
Key Moments - 2 Insights
Why does casting a string column with None values to int produce nulls instead of errors?
Casting converts valid strings to integers but replaces None or invalid strings with nulls, as shown in step 2 of the execution_table where 'None' becomes null in 'value_int'.
How does filling nulls with a value like 0 affect the DataFrame?
Filling nulls replaces all null entries with the specified value, preventing null-related errors later. Step 3 shows null in 'value_int' replaced by 0.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at Step 2, what is the value of 'value_int' for the row where 'value' is None?
A10
Bnull
C0
DNone
💡 Hint
Check the 'Column value_int' at Step 2 in execution_table for the row with 'value' None.
At which step are nulls in 'value_int' replaced with 0?
AStep 1
BStep 2
CStep 3
DNo step replaces nulls
💡 Hint
Look at the 'Null Handling' column in execution_table.
If we skip the null filling step, what would be the value of 'value_int' for the row with 'value' None?
Anull
B0
CNone
DError
💡 Hint
Refer to Step 2 in execution_table where casting happens but no null filling yet.
Concept Snapshot
Type casting changes column data types (e.g., string to int).
Nulls appear when casting invalid or missing data.
Use na.fill() or na.drop() to handle nulls.
Filling nulls prevents errors in later processing.
Always check data after casting for nulls.
Full Transcript
We start with a DataFrame containing strings and None values. We cast the string column to integers. Valid strings become integers, but None becomes null. Then, we fill nulls with zero to avoid problems later. This process ensures the data is clean and ready for analysis.