0
0
Apache Sparkdata~10 mins

Data quality assertions in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Data quality assertions
Load DataFrame
Define Assertions
Apply Assertions on Data
Check Assertion Results
Continue
Load data, define checks (assertions), apply them, then handle pass or fail results.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'A'), (2, None)], ['id', 'value'])
assert df.filter(df.value.isNull()).count() == 0, 'Null values found!'
This code checks if the 'value' column has any nulls and raises an error if it does.
Execution Table
StepActionEvaluationResult
1Load DataFrame with 2 rowsDataFrame createdRows: 2, Columns: id, value
2Define assertion: no nulls in 'value'Prepare filter conditionCondition ready
3Filter rows where 'value' is nullFilter result count = 1One row has null
4Check if count == 01 == 0False
5Assertion failsRaise error 'Null values found!'Error raised, stop execution
💡 Assertion failed because there is 1 null value in 'value' column
Variable Tracker
VariableStartAfter Step 3After Step 4Final
dfEmptyDataFrame with 2 rowsSameSame
null_countUndefined111
assertion_resultUndefinedUndefinedFalseError raised
Key Moments - 2 Insights
Why does the assertion fail even though the DataFrame has data?
Because the assertion checks for zero nulls in 'value' column, but step 3 shows 1 null row (see execution_table row 3).
What happens if the assertion condition is True?
If the condition is True (count == 0), the code continues without error (not shown here but implied after step 4).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the count of null values in 'value' column at step 3?
A0
B2
C1
DUndefined
💡 Hint
Check the 'Evaluation' column at step 3 in the execution_table.
At which step does the assertion fail and raise an error?
AStep 2
BStep 5
CStep 3
DStep 4
💡 Hint
Look for the row mentioning 'Assertion fails' and 'Error raised' in execution_table.
If the DataFrame had no nulls in 'value', how would the assertion result change at step 4?
AIt would be True
BIt would be False
CIt would raise an error
DIt would be undefined
💡 Hint
Step 4 compares null_count == 0; if no nulls, count is 0, so condition is True.
Concept Snapshot
Data quality assertions check if data meets rules.
Use filters to find bad data (e.g., nulls).
Assert conditions must be True to pass.
If False, raise error to stop or handle.
Helps catch data issues early.
Full Transcript
This visual execution shows how to use data quality assertions in Apache Spark. First, a DataFrame is loaded with sample data. Then, an assertion is defined to check that the 'value' column has no nulls. The code filters the DataFrame to count nulls and compares the count to zero. Since there is one null, the assertion fails and raises an error. Variables like the DataFrame and null count are tracked step-by-step. Key moments clarify why the assertion fails and what happens if it passes. The quiz tests understanding of the null count, failure step, and condition logic. This helps beginners see how assertions catch data problems early in data science workflows.