0
0
Apache Sparkdata~30 mins

Type casting and null handling in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Type Casting and Null Handling in Apache Spark
📖 Scenario: You work as a data analyst for a retail company. You receive sales data where some numbers are stored as strings and some values are missing (null). You need to clean this data by converting the strings to numbers and handling the missing values properly.
🎯 Goal: Build a Spark DataFrame with sales data, convert the sales amount from string to integer, replace null sales with zero, and display the cleaned data.
📋 What You'll Learn
Create a Spark DataFrame with specific sales data including null values
Create a variable for the replacement value for nulls
Cast the sales column from string to integer and replace nulls with zero
Print the final cleaned DataFrame
💡 Why This Matters
🌍 Real World
Data often comes with missing values and wrong data types. Cleaning data by converting types and handling nulls is a key step before analysis.
💼 Career
Data scientists and analysts frequently clean and prepare data using Spark for big data projects, ensuring accurate and reliable results.
Progress0 / 4 steps
1
Create the initial Spark DataFrame
Create a Spark DataFrame called sales_df with the following data: [('Store1', '100'), ('Store2', null), ('Store3', '250'), ('Store4', null)]. The columns should be named 'store' and 'sales'. Use spark.createDataFrame() and Row from pyspark.sql.
Apache Spark
Need a hint?

Use Row to create each row with store and sales fields. Use None for null values.

2
Set the replacement value for null sales
Create a variable called replacement_value and set it to 0. This will be used to replace null sales values.
Apache Spark
Need a hint?

Just create a variable named replacement_value and assign 0 to it.

3
Cast sales to integer and replace nulls
Create a new DataFrame called cleaned_df by casting the sales column of sales_df from string to integer using cast('int'). Then replace null values in sales with replacement_value using fillna().
Apache Spark
Need a hint?

Use withColumn with col('sales').cast('int') to convert types. Then use fillna({'sales': replacement_value}) to replace nulls.

4
Display the cleaned DataFrame
Print the contents of cleaned_df using the show() method to display the cleaned sales data.
Apache Spark
Need a hint?

Use cleaned_df.show() to display the DataFrame.