Apache Sparkdata~30 mins

Null and duplicate detection in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Null and Duplicate Detection

📖 Scenario: You work as a data analyst for an online store. You receive a dataset of customer orders. Some orders have missing information or are repeated by mistake. You want to find these problems before analyzing the data.

🎯 Goal: You will create a Spark DataFrame with order data, set a threshold for missing values, find rows with nulls, detect duplicate rows, and print the results.

📋 What You'll Learn

Create a Spark DataFrame with given order data

Create a variable for the threshold of allowed null values

Use Spark functions to find rows with null values

Use Spark functions to find duplicate rows

Print the rows with nulls and duplicates

💡 Why This Matters

🌍 Real World

Detecting missing and repeated data is important before analyzing or modeling data. It helps keep data clean and reliable.

💼 Career

Data scientists and analysts often clean data by finding and handling nulls and duplicates to improve data quality.

Progress0 / 4 steps

Create the Spark DataFrame

Create a Spark DataFrame called orders with these exact rows and columns: order_id, customer_id, product, quantity. Use these rows: (1, 101, 'Book', 2), (2, 102, 'Pen', null), (3, 103, 'Notebook', 1), (4, 101, 'Book', 2), (5, null, 'Pencil', 3).

Apache Spark

# Create the Spark DataFrame called orders with the given data
# Your code here

Need a hint?

Use spark.createDataFrame with a schema to create the DataFrame.

Set the null threshold

Create an integer variable called null_threshold and set it to 1. This will be the maximum allowed number of null values per row.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName('NullDuplicateDetection').getOrCreate()

schema = StructType([
    StructField('order_id', IntegerType(), True),
    StructField('customer_id', IntegerType(), True),
    StructField('product', StringType(), True),
    StructField('quantity', IntegerType(), True)
])

data = [
    (1, 101, 'Book', 2),
    (2, 102, 'Pen', None),
    (3, 103, 'Notebook', 1),
    (4, 101, 'Book', 2),
    (5, None, 'Pencil', 3)
]

orders = spark.createDataFrame(data, schema)

# Create null_threshold variable and set it to 1
# Your code here

Need a hint?

Just create a variable named null_threshold and assign it the value 1.

Find rows with null values and duplicates

Create a DataFrame called null_rows that contains rows from orders where the number of null values is greater than null_threshold. Then create a DataFrame called duplicate_rows that contains rows from orders that are duplicates (appear more than once). Use Spark functions to do this.

Apache Spark

from pyspark.sql import functions as F

# Create null_rows DataFrame with rows having more than null_threshold nulls
# Create duplicate_rows DataFrame with duplicate rows
# Your code here

Need a hint?

Use isNull() and cast('int') to count nulls per row. Use groupBy and count() to find duplicates.

Print the rows with nulls and duplicates

Print the rows in null_rows and duplicate_rows DataFrames using show().

Apache Spark

# Print the rows with null values
# Print the duplicate rows
# Your code here

Need a hint?

Use print() and show() to display the DataFrames.