0
0
dbtdata~30 mins

Why sources define raw data contracts in dbt - See It in Action

Choose your learning style9 modes available
Understanding Why Sources Define Raw Data Contracts in dbt
📖 Scenario: Imagine you work in a company where multiple teams provide data to a central data warehouse. Each team sends raw data files with different formats and quality. To keep the data clean and reliable, your team uses dbt to manage data transformations and ensure everyone agrees on the data format.
🎯 Goal: You will create a simple example to understand why defining sources in dbt acts as a raw data contract. This contract helps your team know what raw data to expect and how to check it before using it in reports.
📋 What You'll Learn
Create a dictionary called raw_data_sources with exact source names and their expected columns
Create a variable called required_columns listing columns that must be present
Write a loop using for source, columns in raw_data_sources.items() to check if required columns exist
Print the results showing which sources meet the raw data contract
💡 Why This Matters
🌍 Real World
In real companies, raw data comes from many places. Defining sources as contracts helps data teams trust and use data safely.
💼 Career
Data engineers and analysts use raw data contracts in dbt to ensure data quality and avoid errors in reports and dashboards.
Progress0 / 4 steps
1
Create the raw data sources dictionary
Create a dictionary called raw_data_sources with these exact entries: 'sales_data': ['order_id', 'customer_id', 'amount', 'date'], 'inventory_data': ['product_id', 'stock', 'warehouse'], and 'customer_data': ['customer_id', 'name', 'email'].
dbt
Need a hint?

Use a dictionary with keys as source names and values as lists of column names.

2
Define the required columns list
Create a list called required_columns with these exact values: 'customer_id' and 'date'.
dbt
Need a hint?

Use a list with the exact column names required.

3
Check each source for required columns
Use a for loop with variables source and columns to iterate over raw_data_sources.items(). Inside the loop, create a variable has_all_required that is True if all required_columns are in columns, otherwise False. Store the results in a dictionary called contract_check with source names as keys and has_all_required as values.
dbt
Need a hint?

Use all() to check if every required column is in the source columns.

4
Print the contract check results
Write a print(contract_check) statement to display which sources meet the raw data contract.
dbt
Need a hint?

Use print(contract_check) to show the results.