0
0
Apache Sparkdata~10 mins

Null and duplicate detection in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Null and duplicate detection
Start with DataFrame
Check for Nulls
Count Nulls per Column
Check for Duplicates
Count Duplicate Rows
Output Null and Duplicate Summary
End
We start with a DataFrame, check each column for null values, count them, then check for duplicate rows and count those, finally outputting a summary.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, count, lit
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
  [(1, 'A'), (2, None), (1, 'A')], ['id', 'value'])

null_counts = df.select([count(when(col(c).isNull(), lit(1))).alias(c) for c in df.columns]).collect()[0].asDict()
dup_count = df.count() - df.dropDuplicates().count()
This code creates a DataFrame, counts nulls per column, and counts duplicate rows.
Execution Table
StepActionDataFrame StateNull CountsDuplicate Count
1Create DataFrame with 3 rows[ (1, 'A'), (2, None), (1, 'A') ]N/AN/A
2Count nulls in 'id'Sameid: 0N/A
3Count nulls in 'value'Samevalue: 1N/A
4Count total nulls per columnSame{'id': 0, 'value': 1}N/A
5Count total rows3 rowsSameN/A
6Drop duplicates[ (1, 'A'), (2, None) ]SameN/A
7Count rows after dropDuplicates2 rowsSameN/A
8Calculate duplicates: 3 - 2SameSame1
9Output null counts and duplicate countSame{'id': 0, 'value': 1}1
10End of detectionSameSameSame
💡 All rows processed; nulls and duplicates counted.
Variable Tracker
VariableStartAfter Step 2After Step 4After Step 7After Step 8Final
dfEmpty3 rows with dataSame2 rows after dropDuplicatesSameSame
null_countsN/APartial counts{'id': 0, 'value': 1}SameSame{'id': 0, 'value': 1}
dup_countN/AN/AN/AN/A11
Key Moments - 3 Insights
Why does the null count for 'value' show 1 but 'id' shows 0?
Because in the DataFrame, only the 'value' column has a None (null) value at step 3, while 'id' has no nulls, as shown in execution_table rows 2-4.
How is the duplicate count calculated?
Duplicate count is total rows minus rows after dropping duplicates, as shown in execution_table rows 5-8.
Does dropDuplicates remove all duplicates or just some?
dropDuplicates removes all exact duplicate rows, leaving only unique rows, as seen in execution_table row 6.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table at step 4, what is the null count for the 'value' column?
A0
B2
C1
D3
💡 Hint
Check the 'Null Counts' column at step 4 in the execution_table.
At which step does the DataFrame reduce from 3 rows to 2 rows?
AStep 2
BStep 6
CStep 5
DStep 8
💡 Hint
Look at the 'DataFrame State' column showing row counts in execution_table.
If the DataFrame had no duplicate rows, what would the duplicate count be at step 8?
A0
B1
C2
D3
💡 Hint
Duplicate count is total rows minus unique rows after dropDuplicates, see execution_table step 8.
Concept Snapshot
Null and duplicate detection in Spark:
- Use isNull() with when() and count() to find nulls per column.
- Use dropDuplicates() to remove duplicate rows.
- Duplicate count = total rows - unique rows.
- Helps clean data before analysis.
Full Transcript
This visual execution shows how to detect null and duplicate values in a Spark DataFrame. We start by creating a DataFrame with some null and duplicate values. Then, we count nulls in each column using isNull and count functions. Next, we find duplicates by comparing the total row count with the count after removing duplicates using dropDuplicates. The difference gives the number of duplicate rows. This process helps identify data quality issues before analysis.