dbtdata~30 mins

Why sources define raw data contracts in dbt - See It in Action

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Understanding Why Sources Define Raw Data Contracts in dbt

📖 Scenario: Imagine you work in a company where multiple teams provide data to a central data warehouse. Each team sends raw data files with different formats and quality. To keep the data clean and reliable, your team uses dbt to manage data transformations and ensure everyone agrees on the data format.

🎯 Goal: You will create a simple example to understand why defining sources in dbt acts as a raw data contract. This contract helps your team know what raw data to expect and how to check it before using it in reports.

📋 What You'll Learn

Create a dictionary called raw_data_sources with exact source names and their expected columns

Create a variable called required_columns listing columns that must be present

Write a loop using for source, columns in raw_data_sources.items() to check if required columns exist

Print the results showing which sources meet the raw data contract

💡 Why This Matters

🌍 Real World

In real companies, raw data comes from many places. Defining sources as contracts helps data teams trust and use data safely.

💼 Career

Data engineers and analysts use raw data contracts in dbt to ensure data quality and avoid errors in reports and dashboards.

Progress0 / 4 steps

Create the raw data sources dictionary

Create a dictionary called raw_data_sources with these exact entries: 'sales_data': ['order_id', 'customer_id', 'amount', 'date'], 'inventory_data': ['product_id', 'stock', 'warehouse'], and 'customer_data': ['customer_id', 'name', 'email'].

dbt

# Create the raw_data_sources dictionary with exact keys and lists
# Your code here

Hint

Use a dictionary with keys as source names and values as lists of column names.

Define the required columns list

Create a list called required_columns with these exact values: 'customer_id' and 'date'.

dbt

raw_data_sources = {
    'sales_data': ['order_id', 'customer_id', 'amount', 'date'],
    'inventory_data': ['product_id', 'stock', 'warehouse'],
    'customer_data': ['customer_id', 'name', 'email']
}
# Create the required_columns list with 'customer_id' and 'date'
# Your code here

Hint

Use a list with the exact column names required.

Check each source for required columns

Use a for loop with variables source and columns to iterate over raw_data_sources.items(). Inside the loop, create a variable has_all_required that is True if all required_columns are in columns, otherwise False. Store the results in a dictionary called contract_check with source names as keys and has_all_required as values.

dbt

raw_data_sources = {
    'sales_data': ['order_id', 'customer_id', 'amount', 'date'],
    'inventory_data': ['product_id', 'stock', 'warehouse'],
    'customer_data': ['customer_id', 'name', 'email']
}

required_columns = ['customer_id', 'date']

# Create an empty dictionary contract_check
# Use a for loop to check if each source has all required columns
# Your code here

Hint

Use all() to check if every required column is in the source columns.

Print the contract check results

Write a print(contract_check) statement to display which sources meet the raw data contract.

dbt

raw_data_sources = {
    'sales_data': ['order_id', 'customer_id', 'amount', 'date'],
    'inventory_data': ['product_id', 'stock', 'warehouse'],
    'customer_data': ['customer_id', 'name', 'email']
}

required_columns = ['customer_id', 'date']

contract_check = {}
for source, columns in raw_data_sources.items():
    has_all_required = all(col in columns for col in required_columns)
    contract_check[source] = has_all_required

# Print the contract_check dictionary
# Your code here

Hint

Use print(contract_check) to show the results.

Practice

(1/5)

1. Why do we define raw data contracts in dbt sources?

easy

A. To set clear expectations for the raw data coming into the system

B. To speed up the data loading process

C. To automatically fix data errors

D. To create visual reports from raw data

Why sources define raw data contracts in dbt - See It in Action

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of raw data contracts

Step 2: Identify the main benefit in dbt context

Final Answer:

Quick Check:

Solution

Step 1: Recall dbt source YAML structure

Step 2: Match correct indentation and keys

Final Answer:

Quick Check:

Solution

Step 1: Understand the 'not_null' test in dbt

Step 2: Predict test behavior on null data

Final Answer:

Quick Check:

Solution

Step 1: Check YAML syntax for tests

Step 2: Identify the error in tests format

Final Answer:

Quick Check:

Solution

Step 1: Identify required tests for 'order_id'

Step 2: Define tests for 'order_date'

Step 3: Combine tests in source YAML

Final Answer:

Quick Check: