Apache Sparkdata~10 mins

Data quality assertions in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Data quality assertions

Load DataFrame

↓

Define Assertions

↓

Apply Assertions on Data

↓

Check Assertion Results

↓

Continue

Load data, define checks (assertions), apply them, then handle pass or fail results.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'A'), (2, None)], ['id', 'value'])
assert df.filter(df.value.isNull()).count() == 0, 'Null values found!'

This code checks if the 'value' column has any nulls and raises an error if it does.

Execution Table

Step	Action	Evaluation	Result
1	Load DataFrame with 2 rows	DataFrame created	Rows: 2, Columns: id, value
2	Define assertion: no nulls in 'value'	Prepare filter condition	Condition ready
3	Filter rows where 'value' is null	Filter result count = 1	One row has null
4	Check if count == 0	1 == 0	False
5	Assertion fails	Raise error 'Null values found!'	Error raised, stop execution

💡 Assertion failed because there is 1 null value in 'value' column

Variable Tracker

Variable	Start	After Step 3	After Step 4	Final
df	Empty	DataFrame with 2 rows	Same	Same
null_count	Undefined	1	1	1
assertion_result	Undefined	Undefined	False	Error raised

Key Moments - 2 Insights

Why does the assertion fail even though the DataFrame has data?

What happens if the assertion condition is True?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the count of null values in 'value' column at step 3?

DUndefined

Concept Snapshot

Data quality assertions check if data meets rules.
Use filters to find bad data (e.g., nulls).
Assert conditions must be True to pass.
If False, raise error to stop or handle.
Helps catch data issues early.

Full Transcript

This visual execution shows how to use data quality assertions in Apache Spark. First, a DataFrame is loaded with sample data. Then, an assertion is defined to check that the 'value' column has no nulls. The code filters the DataFrame to count nulls and compares the count to zero. Since there is one null, the assertion fails and raises an error. Variables like the DataFrame and null count are tracked step-by-step. Key moments clarify why the assertion fails and what happens if it passes. The quiz tests understanding of the null count, failure step, and condition logic. This helps beginners see how assertions catch data problems early in data science workflows.