What if you could stop chasing data errors and start trusting your data every day?
Why sources define raw data contracts in dbt - The Real Reasons
Start learning this pattern below
Jump into concepts and practice - no test required
Imagine you receive data files from different teams every day. Each file has different formats, missing columns, or unexpected changes. You try to process them manually by checking each file before using it.
This manual checking is slow and tiring. You often miss changes or errors, causing your reports to be wrong. Fixing these mistakes later wastes time and causes frustration.
Defining raw data contracts means setting clear rules about what the data should look like before you use it. This helps catch problems early and keeps your data clean and reliable automatically.
if 'date' in data.columns and data['date'].notnull().all(): process(data)
sources('raw_data').expect_columns(['date', 'id', 'value']).expect_not_null('date')
It lets you trust your data pipeline and focus on analysis, not fixing data problems.
A marketing team defines a contract for customer data to ensure every record has an ID and signup date. This prevents errors in campaign reports caused by missing or wrong data.
Manual data checks are slow and error-prone.
Raw data contracts set clear expectations for incoming data.
They help catch issues early and keep data reliable.
Practice
Solution
Step 1: Understand the purpose of raw data contracts
Raw data contracts define what the incoming data should look like, such as expected columns and types.Step 2: Identify the main benefit in dbt context
They help teams know what to expect and catch issues early, not speed up loading or fix errors automatically.Final Answer:
To set clear expectations for the raw data coming into the system -> Option AQuick Check:
Raw data contracts = clear data expectations [OK]
- Thinking contracts speed up data loading
- Assuming contracts fix data automatically
- Confusing contracts with reporting tools
Solution
Step 1: Recall dbt source YAML structure
Sources are defined undersources:as a list withnameandtableskeys.Step 2: Match correct indentation and keys
sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] correctly usessourceslist,name,tables, andcolumnswith tests.Final Answer:
sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] -> Option BQuick Check:
dbt source YAML = list with name, tables, columns [OK]
- Using singular 'source' instead of 'sources'
- Incorrect indentation or missing keys
- Listing columns without proper nesting
sources:
- name: raw_sales
tables:
- name: transactions
columns:
- name: transaction_id
tests: [not_null, unique]
- name: amount
tests: [not_null]
What happens if a transaction has a null amount when running dbt tests?Solution
Step 1: Understand the 'not_null' test in dbt
The 'not_null' test checks that no null values exist in the specified column.Step 2: Predict test behavior on null data
If a null value exists in 'amount', the 'not_null' test will fail and alert the user.Final Answer:
The test will fail, alerting that a null value exists in 'amount' -> Option DQuick Check:
'not_null' test fails on nulls [OK]
- Thinking dbt fills nulls automatically
- Assuming tests pass by default
- Believing tests skip columns with nulls
sources:
- name: raw_data
tables:
- name: customers
columns:
- name: customer_id
tests: not_null, unique
When running dbt, you get a syntax error. What is the problem?Solution
Step 1: Check YAML syntax for tests
Tests must be listed as a YAML list inside square brackets or as a list with dashes.Step 2: Identify the error in tests format
Writingtests: not_null, uniqueis invalid YAML; it should betests: [not_null, unique].Final Answer:
The tests list should be inside square brackets [ ] -> Option AQuick Check:
Tests need brackets or dashes in YAML [OK]
- Writing tests as comma-separated string without brackets
- Using wrong key names like 'column' instead of 'columns'
- Misunderstanding list vs dictionary in YAML
Solution
Step 1: Identify required tests for 'order_id'
To ensure uniqueness and no nulls, use tests [not_null, unique] on 'order_id'.Step 2: Define tests for 'order_date'
To ensure presence and valid dates, use [not_null] and a test like 'accepted_values' or a custom test for date format.Step 3: Combine tests in source YAML
Include both columns with their respective tests to catch issues early at the source level.Final Answer:
Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates -> Option CQuick Check:
Raw data contracts include all critical tests [OK]
- Skipping tests on important columns
- Relying on downstream models for raw data validation
- Not testing data formats like dates
