Data quality assertions in Apache Spark - Time & Space Complexity
We want to understand how the time needed to check data quality changes as the data grows.
How does the number of checks grow when we have more data rows?
Analyze the time complexity of the following code snippet.
from pyspark.sql.functions import col
def assert_no_nulls(df, column_name):
null_count = df.filter(col(column_name).isNull()).count()
assert null_count == 0, f"Null values found in {column_name}"
# Example usage
assert_no_nulls(dataframe, "age")
This code checks if a specific column in a Spark DataFrame has any null values by filtering and counting them.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Filtering rows to find null values in one column.
- How many times: The filter operation scans all rows once.
The time to check grows roughly in direct proportion to the number of rows because each row is checked once.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 checks |
| 100 | 100 checks |
| 1000 | 1000 checks |
Pattern observation: Doubling the data roughly doubles the work needed to check for nulls.
Time Complexity: O(n)
This means the time to check data quality grows linearly with the number of rows.
[X] Wrong: "Checking for nulls is instant no matter how big the data is."
[OK] Correct: The system must look at each row to find nulls, so more rows mean more work.
Understanding how data quality checks scale helps you design efficient data pipelines and shows you know how to handle growing data.
"What if we check multiple columns for nulls in one pass? How would the time complexity change?"