0
0
Apache Sparkdata~30 mins

Null and duplicate detection in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Null and Duplicate Detection
📖 Scenario: You work as a data analyst for an online store. You receive a dataset of customer orders. Some orders have missing information or are repeated by mistake. You want to find these problems before analyzing the data.
🎯 Goal: You will create a Spark DataFrame with order data, set a threshold for missing values, find rows with nulls, detect duplicate rows, and print the results.
📋 What You'll Learn
Create a Spark DataFrame with given order data
Create a variable for the threshold of allowed null values
Use Spark functions to find rows with null values
Use Spark functions to find duplicate rows
Print the rows with nulls and duplicates
💡 Why This Matters
🌍 Real World
Detecting missing and repeated data is important before analyzing or modeling data. It helps keep data clean and reliable.
💼 Career
Data scientists and analysts often clean data by finding and handling nulls and duplicates to improve data quality.
Progress0 / 4 steps
1
Create the Spark DataFrame
Create a Spark DataFrame called orders with these exact rows and columns: order_id, customer_id, product, quantity. Use these rows: (1, 101, 'Book', 2), (2, 102, 'Pen', null), (3, 103, 'Notebook', 1), (4, 101, 'Book', 2), (5, null, 'Pencil', 3).
Apache Spark
Need a hint?

Use spark.createDataFrame with a schema to create the DataFrame.

2
Set the null threshold
Create an integer variable called null_threshold and set it to 1. This will be the maximum allowed number of null values per row.
Apache Spark
Need a hint?

Just create a variable named null_threshold and assign it the value 1.

3
Find rows with null values and duplicates
Create a DataFrame called null_rows that contains rows from orders where the number of null values is greater than null_threshold. Then create a DataFrame called duplicate_rows that contains rows from orders that are duplicates (appear more than once). Use Spark functions to do this.
Apache Spark
Need a hint?

Use isNull() and cast('int') to count nulls per row. Use groupBy and count() to find duplicates.

4
Print the rows with nulls and duplicates
Print the rows in null_rows and duplicate_rows DataFrames using show().
Apache Spark
Need a hint?

Use print() and show() to display the DataFrames.