What is Data quality assertions in Apache Spark?

Apache Sparkdata~5 mins

Data quality assertions in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Data quality assertions help check if your data is correct and reliable before using it. They catch mistakes early so you can fix them.

When loading new data to make sure it meets expected rules.

Before running reports to ensure data is complete and accurate.

When cleaning data to verify no invalid values remain.

During data pipelines to stop processing if data is bad.

Before training machine learning models to ensure good input.

Syntax

Apache Spark

from pyspark.sql.functions import col

dataframe.filter(condition).count() == expected_count

Use Spark DataFrame filters to check conditions on your data.

Assertions usually compare counts or check for nulls and value ranges.

Examples

Check that no ages are negative.

Apache Spark

df.filter(col('age') < 0).count() == 0

Assert that there are no missing emails.

Apache Spark

df.filter(col('email').isNull()).count() == 0

Allow up to 9 salaries below 30000, but no more.

Apache Spark

df.filter(col('salary') < 30000).count() < 10

Sample Program

This code creates a small dataset and checks three data quality rules using assertions. It prints whether each rule passes or fails.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName('DataQualityAssertions').getOrCreate()

# Sample data
data = [
    (1, 'Alice', 25, 'alice@example.com', 50000),
    (2, 'Bob', 30, None, 60000),
    (3, 'Charlie', -5, 'charlie@example.com', 70000),
    (4, 'David', 40, 'david@example.com', 25000)
]

columns = ['id', 'name', 'age', 'email', 'salary']

df = spark.createDataFrame(data, columns)

# Assertion 1: No negative ages
no_negative_ages = df.filter(col('age') < 0).count() == 0

# Assertion 2: No missing emails
no_missing_emails = df.filter(col('email').isNull()).count() == 0

# Assertion 3: At most 1 salary below 30000
few_low_salaries = df.filter(col('salary') < 30000).count() <= 1

print(f'No negative ages: {no_negative_ages}')
print(f'No missing emails: {no_missing_emails}')
print(f'At most 1 low salary: {few_low_salaries}')

spark.stop()

OutputSuccess

Important Notes

Assertions return True if data meets the rule, False if not.

Use assertions early to catch bad data before analysis.

You can combine multiple assertions for thorough checks.

Summary

Data quality assertions check if data follows expected rules.

They help find errors like missing or invalid values.

Use Spark filters and counts to write simple assertions.