Concept Flow - Data quality assertions
Load DataFrame
Define Assertions
Apply Assertions on Data
Check Assertion Results
Continue
Load data, define checks (assertions), apply them, then handle pass or fail results.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([(1, 'A'), (2, None)], ['id', 'value']) assert df.filter(df.value.isNull()).count() == 0, 'Null values found!'
| Step | Action | Evaluation | Result |
|---|---|---|---|
| 1 | Load DataFrame with 2 rows | DataFrame created | Rows: 2, Columns: id, value |
| 2 | Define assertion: no nulls in 'value' | Prepare filter condition | Condition ready |
| 3 | Filter rows where 'value' is null | Filter result count = 1 | One row has null |
| 4 | Check if count == 0 | 1 == 0 | False |
| 5 | Assertion fails | Raise error 'Null values found!' | Error raised, stop execution |
| Variable | Start | After Step 3 | After Step 4 | Final |
|---|---|---|---|---|
| df | Empty | DataFrame with 2 rows | Same | Same |
| null_count | Undefined | 1 | 1 | 1 |
| assertion_result | Undefined | Undefined | False | Error raised |
Data quality assertions check if data meets rules. Use filters to find bad data (e.g., nulls). Assert conditions must be True to pass. If False, raise error to stop or handle. Helps catch data issues early.