Bird
Raised Fist0
dbtdata~10 mins

Why sources define raw data contracts in dbt - Visual Breakdown

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Concept Flow - Why sources define raw data contracts
Raw Data Source
Define Raw Data Contract
Set Expectations: Schema, Types, Quality
Data Consumers Use Contract
Detect Changes or Errors Early
Maintain Data Reliability
This flow shows how defining raw data contracts sets clear rules for raw data, helping users trust and use data safely.
Execution Sample
dbt
sources:
  - name: raw_sales
    tables:
      - name: transactions
        freshness:
          warn_after:
            count: 24
            period: hour
This dbt source config defines a raw data contract for the 'transactions' table with freshness expectations.
Execution Table
StepActionEvaluationResult
1Define source 'raw_sales.transactions'Set schema and freshness rulesContract established for raw data
2Data pipeline loads raw dataCheck data against contractData matches schema and freshness
3Data consumer queries sourceUses contract to trust dataReliable data used in models
4Raw data changes unexpectedlyContract detects schema or freshness violationAlert triggered for investigation
5Fix data or update contractRestore contract complianceData reliability maintained
6EndAll checks passed or issues resolvedData pipeline stable
💡 Execution stops when data contract is either met or violations are detected and addressed
Variable Tracker
VariableStartAfter Step 2After Step 4Final
Data Contract StatusNot definedActive and validViolation detectedRestored or updated
Key Moments - 2 Insights
Why do we define a raw data contract instead of just trusting the raw data?
Defining a contract sets clear expectations for schema and freshness, so any unexpected changes or errors are caught early, as shown in step 4 of the execution table.
What happens if the raw data changes but the contract is not updated?
The contract detects violations and triggers alerts (step 4), preventing unreliable data from being used downstream.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the result after step 3?
AData consumer uses unreliable data
BContract is violated
CData consumer uses reliable data
DData pipeline stops
💡 Hint
Check the 'Result' column in row for step 3 in the execution table
At which step does the contract detect a violation?
AStep 4
BStep 2
CStep 3
DStep 5
💡 Hint
Look for 'Violation detected' in the 'Evaluation' column in the execution table
If the contract was never defined, what would be the status at 'After Step 2' in variable_tracker?
AViolation detected
BNot defined
CActive and valid
DRestored or updated
💡 Hint
Refer to the 'Data Contract Status' row in variable_tracker for the initial state
Concept Snapshot
Define raw data contracts in dbt sources to set clear rules on schema and freshness.
Contracts help detect data issues early.
They ensure data consumers trust raw data.
Violations trigger alerts for quick fixes.
Maintains overall data reliability.
Full Transcript
In dbt, defining raw data contracts means setting clear expectations for raw data sources, such as schema and freshness rules. This helps catch any unexpected changes or errors early. The flow starts with defining the contract, then loading data, checking it against the contract, and using it reliably downstream. If data changes unexpectedly, the contract detects violations and triggers alerts. Fixing data or updating the contract restores reliability. This process ensures data consumers can trust raw data and maintain stable pipelines.

Practice

(1/5)
1. Why do we define raw data contracts in dbt sources?
easy
A. To set clear expectations for the raw data coming into the system
B. To speed up the data loading process
C. To automatically fix data errors
D. To create visual reports from raw data

Solution

  1. Step 1: Understand the purpose of raw data contracts

    Raw data contracts define what the incoming data should look like, such as expected columns and types.
  2. Step 2: Identify the main benefit in dbt context

    They help teams know what to expect and catch issues early, not speed up loading or fix errors automatically.
  3. Final Answer:

    To set clear expectations for the raw data coming into the system -> Option A
  4. Quick Check:

    Raw data contracts = clear data expectations [OK]
Hint: Raw data contracts = clear data rules for sources [OK]
Common Mistakes:
  • Thinking contracts speed up data loading
  • Assuming contracts fix data automatically
  • Confusing contracts with reporting tools
2. Which of the following is the correct way to define a source in a dbt YAML file for raw data contracts?
easy
A. source: name: raw_data table: users columns: - id tests: [not_null, unique]
B. sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique]
C. sources: raw_data: users: columns: - id tests: [not_null, unique]
D. source: - raw_data - users - columns: - id tests: [not_null, unique]

Solution

  1. Step 1: Recall dbt source YAML structure

    Sources are defined under sources: as a list with name and tables keys.
  2. Step 2: Match correct indentation and keys

    sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] correctly uses sources list, name, tables, and columns with tests.
  3. Final Answer:

    sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] -> Option B
  4. Quick Check:

    dbt source YAML = list with name, tables, columns [OK]
Hint: Sources use list with name and tables keys in YAML [OK]
Common Mistakes:
  • Using singular 'source' instead of 'sources'
  • Incorrect indentation or missing keys
  • Listing columns without proper nesting
3. Given this source definition in dbt YAML:
sources:
  - name: raw_sales
    tables:
      - name: transactions
        columns:
          - name: transaction_id
            tests: [not_null, unique]
          - name: amount
            tests: [not_null]
What happens if a transaction has a null amount when running dbt tests?
medium
A. The test will pass because nulls are allowed by default
B. The test will be skipped for the 'amount' column
C. dbt will automatically fill null amounts with zero
D. The test will fail, alerting that a null value exists in 'amount'

Solution

  1. Step 1: Understand the 'not_null' test in dbt

    The 'not_null' test checks that no null values exist in the specified column.
  2. Step 2: Predict test behavior on null data

    If a null value exists in 'amount', the 'not_null' test will fail and alert the user.
  3. Final Answer:

    The test will fail, alerting that a null value exists in 'amount' -> Option D
  4. Quick Check:

    'not_null' test fails on nulls [OK]
Hint: 'not_null' test fails if any nulls found [OK]
Common Mistakes:
  • Thinking dbt fills nulls automatically
  • Assuming tests pass by default
  • Believing tests skip columns with nulls
4. You wrote this source YAML in dbt:
sources:
  - name: raw_data
    tables:
      - name: customers
        columns:
          - name: customer_id
            tests: not_null, unique
When running dbt, you get a syntax error. What is the problem?
medium
A. The tests list should be inside square brackets [ ]
B. The 'columns' key should be 'column'
C. The 'tables' key should be a dictionary, not a list
D. The source name cannot be 'raw_data'

Solution

  1. Step 1: Check YAML syntax for tests

    Tests must be listed as a YAML list inside square brackets or as a list with dashes.
  2. Step 2: Identify the error in tests format

    Writing tests: not_null, unique is invalid YAML; it should be tests: [not_null, unique].
  3. Final Answer:

    The tests list should be inside square brackets [ ] -> Option A
  4. Quick Check:

    Tests need brackets or dashes in YAML [OK]
Hint: Tests must be a list with brackets or dashes [OK]
Common Mistakes:
  • Writing tests as comma-separated string without brackets
  • Using wrong key names like 'column' instead of 'columns'
  • Misunderstanding list vs dictionary in YAML
5. You want to ensure your raw data source in dbt matches a strict contract: every 'order_id' must be unique and not null, and 'order_date' must be present and in date format. How should you define this in your source YAML to catch issues early?
hard
A. Define the source with columns and add tests only for 'order_id' as unique, ignoring 'order_date'
B. Define the source with columns but no tests; rely on downstream models to catch errors
C. Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates
D. Define the source with columns and add tests for 'order_date' only, ignoring 'order_id'

Solution

  1. Step 1: Identify required tests for 'order_id'

    To ensure uniqueness and no nulls, use tests [not_null, unique] on 'order_id'.
  2. Step 2: Define tests for 'order_date'

    To ensure presence and valid dates, use [not_null] and a test like 'accepted_values' or a custom test for date format.
  3. Step 3: Combine tests in source YAML

    Include both columns with their respective tests to catch issues early at the source level.
  4. Final Answer:

    Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates -> Option C
  5. Quick Check:

    Raw data contracts include all critical tests [OK]
Hint: Test all critical columns with not_null and uniqueness [OK]
Common Mistakes:
  • Skipping tests on important columns
  • Relying on downstream models for raw data validation
  • Not testing data formats like dates