Why sources define raw data contracts in dbt - Performance Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
We want to understand how defining raw data contracts in sources affects the time it takes to process data in dbt.
Specifically, how does checking these contracts scale as data grows?
Analyze the time complexity of this dbt source definition with raw data contracts.
sources:
- name: raw_data
tables:
- name: users
columns:
- name: id
tests:
- not_null
- unique
- name: email
tests:
- not_null
- unique
This code defines a source with raw data contracts that check if columns are not null and unique.
Look at what repeats when dbt runs these tests.
- Primary operation: Scanning each row in the source table to check constraints.
- How many times: Once per test per column, over all rows.
As the number of rows grows, the checks take longer because each row is examined.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | ~40 (2 tests x 2 columns x 10 rows) |
| 100 | ~400 (2 tests x 2 columns x 100 rows) |
| 1000 | ~4000 (2 tests x 2 columns x 1000 rows) |
Pattern observation: Operations grow roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to check raw data contracts grows linearly with the number of rows in the source.
[X] Wrong: "Adding more columns with tests does not affect time much because tests run independently."
[OK] Correct: Each test on each column scans all rows, so more columns or tests multiply the work.
Understanding how data validation scales helps you design efficient data pipelines and write better dbt models.
What if we added a test that compares values between two columns? How would the time complexity change?
Practice
Solution
Step 1: Understand the purpose of raw data contracts
Raw data contracts define what the incoming data should look like, such as expected columns and types.Step 2: Identify the main benefit in dbt context
They help teams know what to expect and catch issues early, not speed up loading or fix errors automatically.Final Answer:
To set clear expectations for the raw data coming into the system -> Option AQuick Check:
Raw data contracts = clear data expectations [OK]
- Thinking contracts speed up data loading
- Assuming contracts fix data automatically
- Confusing contracts with reporting tools
Solution
Step 1: Recall dbt source YAML structure
Sources are defined undersources:as a list withnameandtableskeys.Step 2: Match correct indentation and keys
sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] correctly usessourceslist,name,tables, andcolumnswith tests.Final Answer:
sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] -> Option BQuick Check:
dbt source YAML = list with name, tables, columns [OK]
- Using singular 'source' instead of 'sources'
- Incorrect indentation or missing keys
- Listing columns without proper nesting
sources:
- name: raw_sales
tables:
- name: transactions
columns:
- name: transaction_id
tests: [not_null, unique]
- name: amount
tests: [not_null]
What happens if a transaction has a null amount when running dbt tests?Solution
Step 1: Understand the 'not_null' test in dbt
The 'not_null' test checks that no null values exist in the specified column.Step 2: Predict test behavior on null data
If a null value exists in 'amount', the 'not_null' test will fail and alert the user.Final Answer:
The test will fail, alerting that a null value exists in 'amount' -> Option DQuick Check:
'not_null' test fails on nulls [OK]
- Thinking dbt fills nulls automatically
- Assuming tests pass by default
- Believing tests skip columns with nulls
sources:
- name: raw_data
tables:
- name: customers
columns:
- name: customer_id
tests: not_null, unique
When running dbt, you get a syntax error. What is the problem?Solution
Step 1: Check YAML syntax for tests
Tests must be listed as a YAML list inside square brackets or as a list with dashes.Step 2: Identify the error in tests format
Writingtests: not_null, uniqueis invalid YAML; it should betests: [not_null, unique].Final Answer:
The tests list should be inside square brackets [ ] -> Option AQuick Check:
Tests need brackets or dashes in YAML [OK]
- Writing tests as comma-separated string without brackets
- Using wrong key names like 'column' instead of 'columns'
- Misunderstanding list vs dictionary in YAML
Solution
Step 1: Identify required tests for 'order_id'
To ensure uniqueness and no nulls, use tests [not_null, unique] on 'order_id'.Step 2: Define tests for 'order_date'
To ensure presence and valid dates, use [not_null] and a test like 'accepted_values' or a custom test for date format.Step 3: Combine tests in source YAML
Include both columns with their respective tests to catch issues early at the source level.Final Answer:
Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates -> Option CQuick Check:
Raw data contracts include all critical tests [OK]
- Skipping tests on important columns
- Relying on downstream models for raw data validation
- Not testing data formats like dates
