Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Understanding Why Sources Define Raw Data Contracts in dbt
📖 Scenario: Imagine you work in a company where multiple teams provide data to a central data warehouse. Each team sends raw data files with different formats and quality. To keep the data clean and reliable, your team uses dbt to manage data transformations and ensure everyone agrees on the data format.
🎯 Goal: You will create a simple example to understand why defining sources in dbt acts as a raw data contract. This contract helps your team know what raw data to expect and how to check it before using it in reports.
📋 What You'll Learn
Create a dictionary called raw_data_sources with exact source names and their expected columns
Create a variable called required_columns listing columns that must be present
Write a loop using for source, columns in raw_data_sources.items() to check if required columns exist
Print the results showing which sources meet the raw data contract
💡 Why This Matters
🌍 Real World
In real companies, raw data comes from many places. Defining sources as contracts helps data teams trust and use data safely.
💼 Career
Data engineers and analysts use raw data contracts in dbt to ensure data quality and avoid errors in reports and dashboards.
Progress0 / 4 steps
1
Create the raw data sources dictionary
Create a dictionary called raw_data_sources with these exact entries: 'sales_data': ['order_id', 'customer_id', 'amount', 'date'], 'inventory_data': ['product_id', 'stock', 'warehouse'], and 'customer_data': ['customer_id', 'name', 'email'].
dbt
Hint
Use a dictionary with keys as source names and values as lists of column names.
2
Define the required columns list
Create a list called required_columns with these exact values: 'customer_id' and 'date'.
dbt
Hint
Use a list with the exact column names required.
3
Check each source for required columns
Use a for loop with variables source and columns to iterate over raw_data_sources.items(). Inside the loop, create a variable has_all_required that is True if all required_columns are in columns, otherwise False. Store the results in a dictionary called contract_check with source names as keys and has_all_required as values.
dbt
Hint
Use all() to check if every required column is in the source columns.
4
Print the contract check results
Write a print(contract_check) statement to display which sources meet the raw data contract.
dbt
Hint
Use print(contract_check) to show the results.
Practice
(1/5)
1. Why do we define raw data contracts in dbt sources?
easy
A. To set clear expectations for the raw data coming into the system
B. To speed up the data loading process
C. To automatically fix data errors
D. To create visual reports from raw data
Solution
Step 1: Understand the purpose of raw data contracts
Raw data contracts define what the incoming data should look like, such as expected columns and types.
Step 2: Identify the main benefit in dbt context
They help teams know what to expect and catch issues early, not speed up loading or fix errors automatically.
Final Answer:
To set clear expectations for the raw data coming into the system -> Option A
Quick Check:
Raw data contracts = clear data expectations [OK]
Hint: Raw data contracts = clear data rules for sources [OK]
Common Mistakes:
Thinking contracts speed up data loading
Assuming contracts fix data automatically
Confusing contracts with reporting tools
2. Which of the following is the correct way to define a source in a dbt YAML file for raw data contracts?
easy
A. source:
name: raw_data
table: users
columns:
- id
tests: [not_null, unique]
B. sources:
- name: raw_data
tables:
- name: users
columns:
- name: id
tests: [not_null, unique]
C. sources:
raw_data:
users:
columns:
- id
tests: [not_null, unique]
D. source:
- raw_data
- users
- columns:
- id
tests: [not_null, unique]
Solution
Step 1: Recall dbt source YAML structure
Sources are defined under sources: as a list with name and tables keys.
Step 2: Match correct indentation and keys
sources:
- name: raw_data
tables:
- name: users
columns:
- name: id
tests: [not_null, unique] correctly uses sources list, name, tables, and columns with tests.
Final Answer:
sources:
- name: raw_data
tables:
- name: users
columns:
- name: id
tests: [not_null, unique] -> Option B
Quick Check:
dbt source YAML = list with name, tables, columns [OK]
Hint: Sources use list with name and tables keys in YAML [OK]
When running dbt, you get a syntax error. What is the problem?
medium
A. The tests list should be inside square brackets [ ]
B. The 'columns' key should be 'column'
C. The 'tables' key should be a dictionary, not a list
D. The source name cannot be 'raw_data'
Solution
Step 1: Check YAML syntax for tests
Tests must be listed as a YAML list inside square brackets or as a list with dashes.
Step 2: Identify the error in tests format
Writing tests: not_null, unique is invalid YAML; it should be tests: [not_null, unique].
Final Answer:
The tests list should be inside square brackets [ ] -> Option A
Quick Check:
Tests need brackets or dashes in YAML [OK]
Hint: Tests must be a list with brackets or dashes [OK]
Common Mistakes:
Writing tests as comma-separated string without brackets
Using wrong key names like 'column' instead of 'columns'
Misunderstanding list vs dictionary in YAML
5. You want to ensure your raw data source in dbt matches a strict contract: every 'order_id' must be unique and not null, and 'order_date' must be present and in date format. How should you define this in your source YAML to catch issues early?
hard
A. Define the source with columns and add tests only for 'order_id' as unique, ignoring 'order_date'
B. Define the source with columns but no tests; rely on downstream models to catch errors
C. Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates
D. Define the source with columns and add tests for 'order_date' only, ignoring 'order_id'
Solution
Step 1: Identify required tests for 'order_id'
To ensure uniqueness and no nulls, use tests [not_null, unique] on 'order_id'.
Step 2: Define tests for 'order_date'
To ensure presence and valid dates, use [not_null] and a test like 'accepted_values' or a custom test for date format.
Step 3: Combine tests in source YAML
Include both columns with their respective tests to catch issues early at the source level.
Final Answer:
Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates -> Option C
Quick Check:
Raw data contracts include all critical tests [OK]
Hint: Test all critical columns with not_null and uniqueness [OK]
Common Mistakes:
Skipping tests on important columns
Relying on downstream models for raw data validation