What is Null and duplicate detection in Apache Spark?

Apache Sparkdata~5 mins

Null and duplicate detection in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We check for missing or repeated data to keep our data clean and reliable.

When you want to find missing values in a customer list.

Before analyzing sales data, to remove repeated entries.

To check if survey responses have empty answers.

When preparing data for machine learning to avoid errors.

To ensure data quality before reporting results.

Syntax

Apache Spark

df.filter(df['column'].isNull())
df.dropDuplicates()

# To count nulls in a column:
df.filter(df['column'].isNull()).count()

# To find duplicate rows:
df.groupBy(df.columns).count().filter('count > 1')

isNull() checks for missing values in a column.

dropDuplicates() removes repeated rows from the DataFrame.

Examples

Shows rows where the 'age' column has missing values.

Apache Spark

df.filter(df['age'].isNull()).show()

Removes duplicate rows and shows the unique rows.

Apache Spark

df.dropDuplicates().show()

Finds and shows rows that appear more than once (duplicates).

Apache Spark

df.groupBy(df.columns).count().filter('count > 1').show()

Sample Program

This program creates a small table with some missing ages and repeated rows. It shows how to find missing ages, detect duplicates, and remove duplicates.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('NullDuplicateDemo').getOrCreate()

# Sample data with nulls and duplicates
data = [
    (1, 'Alice', 25),
    (2, 'Bob', None),
    (3, 'Charlie', 30),
    (4, 'Bob', None),
    (5, None, 22),
    (1, 'Alice', 25)
]

columns = ['id', 'name', 'age']

df = spark.createDataFrame(data, columns)

print('Original DataFrame:')
df.show()

print('Rows with nulls in age column:')
df.filter(df['age'].isNull()).show()

print('Duplicate rows:')
df.groupBy(df.columns).count().filter('count > 1').show()

print('DataFrame after dropping duplicates:')
df.dropDuplicates().show()

spark.stop()

OutputSuccess

Important Notes

Null values can cause errors in calculations if not handled.

Duplicates can bias your analysis by counting the same data multiple times.

Always check for nulls and duplicates before deeper analysis.

Summary

Null detection finds missing data in columns.

Duplicate detection finds repeated rows.

Cleaning data improves analysis accuracy.