Apache Sparkdata~10 mins

Null and duplicate detection in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Null and duplicate detection

Start with DataFrame

↓

Check for Nulls

↓

Count Nulls per Column

↓

Check for Duplicates

↓

Count Duplicate Rows

↓

Output Null and Duplicate Summary

↓

End

We start with a DataFrame, check each column for null values, count them, then check for duplicate rows and count those, finally outputting a summary.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, count, lit
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
  [(1, 'A'), (2, None), (1, 'A')], ['id', 'value'])

null_counts = df.select([count(when(col(c).isNull(), lit(1))).alias(c) for c in df.columns]).collect()[0].asDict()
dup_count = df.count() - df.dropDuplicates().count()

This code creates a DataFrame, counts nulls per column, and counts duplicate rows.

Execution Table

Step	Action	DataFrame State	Null Counts	Duplicate Count
1	Create DataFrame with 3 rows	[ (1, 'A'), (2, None), (1, 'A') ]	N/A	N/A
2	Count nulls in 'id'	Same	id: 0	N/A
3	Count nulls in 'value'	Same	value: 1	N/A
4	Count total nulls per column	Same	{'id': 0, 'value': 1}	N/A
5	Count total rows	3 rows	Same	N/A
6	Drop duplicates	[ (1, 'A'), (2, None) ]	Same	N/A
7	Count rows after dropDuplicates	2 rows	Same	N/A
8	Calculate duplicates: 3 - 2	Same	Same	1
9	Output null counts and duplicate count	Same	{'id': 0, 'value': 1}	1
10	End of detection	Same	Same	Same

💡 All rows processed; nulls and duplicates counted.

Variable Tracker

Variable	Start	After Step 2	After Step 4	After Step 7	After Step 8	Final
df	Empty	3 rows with data	Same	2 rows after dropDuplicates	Same	Same
null_counts	N/A	Partial counts	{'id': 0, 'value': 1}	Same	Same	{'id': 0, 'value': 1}
dup_count	N/A	N/A	N/A	N/A	1	1

Key Moments - 3 Insights

Why does the null count for 'value' show 1 but 'id' shows 0?

How is the duplicate count calculated?

Does dropDuplicates remove all duplicates or just some?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table at step 4, what is the null count for the 'value' column?

Concept Snapshot

Null and duplicate detection in Spark:
- Use isNull() with when() and count() to find nulls per column.
- Use dropDuplicates() to remove duplicate rows.
- Duplicate count = total rows - unique rows.
- Helps clean data before analysis.

Full Transcript

This visual execution shows how to detect null and duplicate values in a Spark DataFrame. We start by creating a DataFrame with some null and duplicate values. Then, we count nulls in each column using isNull and count functions. Next, we find duplicates by comparing the total row count with the count after removing duplicates using dropDuplicates. The difference gives the number of duplicate rows. This process helps identify data quality issues before analysis.