dbt-expectations for data quality - Time & Space Complexity
When using dbt-expectations to check data quality, it is important to understand how the time to run these checks changes as data grows.
We want to know how the cost of running these tests scales with the size of the data.
Analyze the time complexity of the following dbt-expectations test.
- name: test_not_null
config:
severity: error
test: not_null
column_name: user_id
model: users
This test checks that the column user_id in the users table has no missing values.
Look at what the test does repeatedly.
- Primary operation: Scanning each row in the
user_idcolumn to check for nulls. - How many times: Once for every row in the
userstable.
As the number of rows grows, the test must check more values.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 checks for null values |
| 100 | 100 checks for null values |
| 1000 | 1000 checks for null values |
Pattern observation: The number of checks grows directly with the number of rows.
Time Complexity: O(n)
This means the time to run the test grows in a straight line as the data size increases.
[X] Wrong: "The test runs instantly no matter how big the data is."
[OK] Correct: The test must look at every row to be sure there are no nulls, so it takes longer with more data.
Understanding how data quality checks scale helps you write efficient tests and explain their impact clearly in real projects.
"What if we added a test that checks uniqueness of a column instead of nulls? How would the time complexity change?"