0
0
Apache Sparkdata~5 mins

Null and duplicate detection in Apache Spark

Choose your learning style9 modes available
Introduction

We check for missing or repeated data to keep our data clean and reliable.

When you want to find missing values in a customer list.
Before analyzing sales data, to remove repeated entries.
To check if survey responses have empty answers.
When preparing data for machine learning to avoid errors.
To ensure data quality before reporting results.
Syntax
Apache Spark
df.filter(df['column'].isNull())
df.dropDuplicates()

# To count nulls in a column:
df.filter(df['column'].isNull()).count()

# To find duplicate rows:
df.groupBy(df.columns).count().filter('count > 1')

isNull() checks for missing values in a column.

dropDuplicates() removes repeated rows from the DataFrame.

Examples
Shows rows where the 'age' column has missing values.
Apache Spark
df.filter(df['age'].isNull()).show()
Removes duplicate rows and shows the unique rows.
Apache Spark
df.dropDuplicates().show()
Finds and shows rows that appear more than once (duplicates).
Apache Spark
df.groupBy(df.columns).count().filter('count > 1').show()
Sample Program

This program creates a small table with some missing ages and repeated rows. It shows how to find missing ages, detect duplicates, and remove duplicates.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('NullDuplicateDemo').getOrCreate()

# Sample data with nulls and duplicates
data = [
    (1, 'Alice', 25),
    (2, 'Bob', None),
    (3, 'Charlie', 30),
    (4, 'Bob', None),
    (5, None, 22),
    (1, 'Alice', 25)
]

columns = ['id', 'name', 'age']

df = spark.createDataFrame(data, columns)

print('Original DataFrame:')
df.show()

print('Rows with nulls in age column:')
df.filter(df['age'].isNull()).show()

print('Duplicate rows:')
df.groupBy(df.columns).count().filter('count > 1').show()

print('DataFrame after dropping duplicates:')
df.dropDuplicates().show()

spark.stop()
OutputSuccess
Important Notes

Null values can cause errors in calculations if not handled.

Duplicates can bias your analysis by counting the same data multiple times.

Always check for nulls and duplicates before deeper analysis.

Summary

Null detection finds missing data in columns.

Duplicate detection finds repeated rows.

Cleaning data improves analysis accuracy.