Bird
Raised Fist0
dbtdata~20 mins

Why sources define raw data contracts in dbt - Challenge Your Understanding

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Raw Data Contract Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Purpose of Raw Data Contracts in Sources

Why do data teams define raw data contracts when setting up sources in dbt?

ATo replace the need for data testing in later stages
BTo speed up the data loading process by skipping validations
CTo automatically generate visualizations from raw data
DTo ensure the incoming data meets expected formats and quality before transformations
Attempts:
2 left
💡 Hint

Think about how contracts help keep data reliable and consistent.

Predict Output
intermediate
2:00remaining
Output of dbt Source Freshness Check

Given this dbt source freshness configuration, what will be the output status if the source data was last updated 3 hours ago?

sources:
  - name: sales_data
    freshness:
      warn_after:
        count: 2
        period: hour
      error_after:
        count: 4
        period: hour
AStatus: warn (data is older than 2 hours but less than 4 hours)
BStatus: pass (data is fresh within 2 hours)
CStatus: error (data is older than 4 hours)
DStatus: unknown (freshness not configured properly)
Attempts:
2 left
💡 Hint

Compare the last update time with the warn and error thresholds.

data_output
advanced
2:00remaining
Result of Source Schema Validation

Consider a source defined with a raw data contract expecting columns id (integer) and date (date). If the actual source data has an extra column status (string), what will be the result of the schema validation in dbt?

AValidation fails because <code>status</code> column type is incorrect
BValidation fails due to unexpected extra column
CValidation passes because extra columns are allowed by default
DValidation passes only if <code>status</code> column is nullable
Attempts:
2 left
💡 Hint

Think about whether dbt schema tests reject extra columns by default.

🔧 Debug
advanced
2:00remaining
Debugging a Failed Source Contract Test

A dbt source test for a raw data contract fails with the error: Column 'user_id' contains null values. The contract expects user_id to be non-nullable. Which option correctly fixes the issue?

ARemove the <code>user_id</code> column from the contract
BUpdate the source data to remove or fill nulls in <code>user_id</code>
CChange the contract to allow <code>user_id</code> to be nullable
DIgnore the test failure and proceed with transformations
Attempts:
2 left
💡 Hint

Consider data quality versus contract expectations.

🚀 Application
expert
3:00remaining
Designing a Raw Data Contract for a New Source

You are adding a new source table orders to your dbt project. The raw data contract must ensure the following:

  • order_id is unique and non-null
  • order_date is non-null and recent (within last 30 days)
  • customer_id is non-null

Which dbt source test configurations correctly enforce these rules?

AUnique test on <code>order_id</code>, not_null tests on <code>order_id</code>, <code>order_date</code>, <code>customer_id</code>, and a custom freshness test on <code>order_date</code>
BUnique test on <code>order_id</code> and <code>customer_id</code>, no freshness test
CNot_null tests on all columns and a unique test on <code>customer_id</code>
DOnly a freshness test on <code>order_date</code> and not_null on <code>order_id</code>
Attempts:
2 left
💡 Hint

Think about which tests enforce uniqueness, non-null, and freshness.

Practice

(1/5)
1. Why do we define raw data contracts in dbt sources?
easy
A. To set clear expectations for the raw data coming into the system
B. To speed up the data loading process
C. To automatically fix data errors
D. To create visual reports from raw data

Solution

  1. Step 1: Understand the purpose of raw data contracts

    Raw data contracts define what the incoming data should look like, such as expected columns and types.
  2. Step 2: Identify the main benefit in dbt context

    They help teams know what to expect and catch issues early, not speed up loading or fix errors automatically.
  3. Final Answer:

    To set clear expectations for the raw data coming into the system -> Option A
  4. Quick Check:

    Raw data contracts = clear data expectations [OK]
Hint: Raw data contracts = clear data rules for sources [OK]
Common Mistakes:
  • Thinking contracts speed up data loading
  • Assuming contracts fix data automatically
  • Confusing contracts with reporting tools
2. Which of the following is the correct way to define a source in a dbt YAML file for raw data contracts?
easy
A. source: name: raw_data table: users columns: - id tests: [not_null, unique]
B. sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique]
C. sources: raw_data: users: columns: - id tests: [not_null, unique]
D. source: - raw_data - users - columns: - id tests: [not_null, unique]

Solution

  1. Step 1: Recall dbt source YAML structure

    Sources are defined under sources: as a list with name and tables keys.
  2. Step 2: Match correct indentation and keys

    sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] correctly uses sources list, name, tables, and columns with tests.
  3. Final Answer:

    sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] -> Option B
  4. Quick Check:

    dbt source YAML = list with name, tables, columns [OK]
Hint: Sources use list with name and tables keys in YAML [OK]
Common Mistakes:
  • Using singular 'source' instead of 'sources'
  • Incorrect indentation or missing keys
  • Listing columns without proper nesting
3. Given this source definition in dbt YAML:
sources:
  - name: raw_sales
    tables:
      - name: transactions
        columns:
          - name: transaction_id
            tests: [not_null, unique]
          - name: amount
            tests: [not_null]
What happens if a transaction has a null amount when running dbt tests?
medium
A. The test will pass because nulls are allowed by default
B. The test will be skipped for the 'amount' column
C. dbt will automatically fill null amounts with zero
D. The test will fail, alerting that a null value exists in 'amount'

Solution

  1. Step 1: Understand the 'not_null' test in dbt

    The 'not_null' test checks that no null values exist in the specified column.
  2. Step 2: Predict test behavior on null data

    If a null value exists in 'amount', the 'not_null' test will fail and alert the user.
  3. Final Answer:

    The test will fail, alerting that a null value exists in 'amount' -> Option D
  4. Quick Check:

    'not_null' test fails on nulls [OK]
Hint: 'not_null' test fails if any nulls found [OK]
Common Mistakes:
  • Thinking dbt fills nulls automatically
  • Assuming tests pass by default
  • Believing tests skip columns with nulls
4. You wrote this source YAML in dbt:
sources:
  - name: raw_data
    tables:
      - name: customers
        columns:
          - name: customer_id
            tests: not_null, unique
When running dbt, you get a syntax error. What is the problem?
medium
A. The tests list should be inside square brackets [ ]
B. The 'columns' key should be 'column'
C. The 'tables' key should be a dictionary, not a list
D. The source name cannot be 'raw_data'

Solution

  1. Step 1: Check YAML syntax for tests

    Tests must be listed as a YAML list inside square brackets or as a list with dashes.
  2. Step 2: Identify the error in tests format

    Writing tests: not_null, unique is invalid YAML; it should be tests: [not_null, unique].
  3. Final Answer:

    The tests list should be inside square brackets [ ] -> Option A
  4. Quick Check:

    Tests need brackets or dashes in YAML [OK]
Hint: Tests must be a list with brackets or dashes [OK]
Common Mistakes:
  • Writing tests as comma-separated string without brackets
  • Using wrong key names like 'column' instead of 'columns'
  • Misunderstanding list vs dictionary in YAML
5. You want to ensure your raw data source in dbt matches a strict contract: every 'order_id' must be unique and not null, and 'order_date' must be present and in date format. How should you define this in your source YAML to catch issues early?
hard
A. Define the source with columns and add tests only for 'order_id' as unique, ignoring 'order_date'
B. Define the source with columns but no tests; rely on downstream models to catch errors
C. Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates
D. Define the source with columns and add tests for 'order_date' only, ignoring 'order_id'

Solution

  1. Step 1: Identify required tests for 'order_id'

    To ensure uniqueness and no nulls, use tests [not_null, unique] on 'order_id'.
  2. Step 2: Define tests for 'order_date'

    To ensure presence and valid dates, use [not_null] and a test like 'accepted_values' or a custom test for date format.
  3. Step 3: Combine tests in source YAML

    Include both columns with their respective tests to catch issues early at the source level.
  4. Final Answer:

    Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates -> Option C
  5. Quick Check:

    Raw data contracts include all critical tests [OK]
Hint: Test all critical columns with not_null and uniqueness [OK]
Common Mistakes:
  • Skipping tests on important columns
  • Relying on downstream models for raw data validation
  • Not testing data formats like dates