Why sources define raw data contracts in dbt - Performance Analysis
We want to understand how defining raw data contracts in sources affects the time it takes to process data in dbt.
Specifically, how does checking these contracts scale as data grows?
Analyze the time complexity of this dbt source definition with raw data contracts.
sources:
- name: raw_data
tables:
- name: users
columns:
- name: id
tests:
- not_null
- unique
- name: email
tests:
- not_null
- unique
This code defines a source with raw data contracts that check if columns are not null and unique.
Look at what repeats when dbt runs these tests.
- Primary operation: Scanning each row in the source table to check constraints.
- How many times: Once per test per column, over all rows.
As the number of rows grows, the checks take longer because each row is examined.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | ~40 (2 tests x 2 columns x 10 rows) |
| 100 | ~400 (2 tests x 2 columns x 100 rows) |
| 1000 | ~4000 (2 tests x 2 columns x 1000 rows) |
Pattern observation: Operations grow roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to check raw data contracts grows linearly with the number of rows in the source.
[X] Wrong: "Adding more columns with tests does not affect time much because tests run independently."
[OK] Correct: Each test on each column scans all rows, so more columns or tests multiply the work.
Understanding how data validation scales helps you design efficient data pipelines and write better dbt models.
What if we added a test that compares values between two columns? How would the time complexity change?