Bird
Raised Fist0
dbtdata~5 mins

Why sources define raw data contracts in dbt - Quick Recap

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is a raw data contract in the context of data sources?
A raw data contract is an agreement or set of rules that defines the expected structure, format, and quality of raw data coming from a source before it is processed or transformed.
Click to reveal answer
beginner
Why do data teams define raw data contracts for sources?
To ensure data consistency, reliability, and to catch errors early by setting clear expectations about the data's format and quality before it enters the transformation pipeline.
Click to reveal answer
intermediate
How do raw data contracts help in data transformation workflows?
They act as a checkpoint that validates incoming data, preventing bad or unexpected data from causing errors downstream in transformations or analyses.
Click to reveal answer
intermediate
What might happen if raw data contracts are not defined for sources?
Without raw data contracts, data inconsistencies or errors can go unnoticed, leading to incorrect analysis, broken pipelines, and loss of trust in data.
Click to reveal answer
beginner
Give an example of a rule that might be included in a raw data contract.
An example rule could be: "The column 'user_id' must always be a non-null integer," ensuring that every record has a valid user identifier.
Click to reveal answer
What is the main purpose of defining raw data contracts for sources?
ATo create visualizations directly from raw data
BTo speed up data transformation by skipping validation
CTo store data in a compressed format
DTo ensure data meets expected structure and quality before processing
Which of the following is NOT a benefit of raw data contracts?
AImproved data reliability
BEarly detection of data errors
CAutomatic data visualization
DClear communication of data expectations
What could happen if raw data contracts are missing?
AData will automatically be corrected
BData pipelines may break due to unexpected data
CData will be faster to process without checks
DData will be encrypted
A raw data contract might specify that a column must be:
ANon-null and of a specific data type
BEncrypted
CRandomly generated
DAlways null
In dbt, why is defining raw data contracts important before transformations?
ATo avoid errors and maintain trust in transformed data
BTo skip testing transformed data
CTo reduce storage costs
DTo create dashboards automatically
Explain why defining raw data contracts for sources is important in a data pipeline.
Think about what happens if bad data enters your system.
You got /4 concepts.
    Describe what kind of rules might be included in a raw data contract.
    Consider how you would check if data is 'correct' before using it.
    You got /4 concepts.

      Practice

      (1/5)
      1. Why do we define raw data contracts in dbt sources?
      easy
      A. To set clear expectations for the raw data coming into the system
      B. To speed up the data loading process
      C. To automatically fix data errors
      D. To create visual reports from raw data

      Solution

      1. Step 1: Understand the purpose of raw data contracts

        Raw data contracts define what the incoming data should look like, such as expected columns and types.
      2. Step 2: Identify the main benefit in dbt context

        They help teams know what to expect and catch issues early, not speed up loading or fix errors automatically.
      3. Final Answer:

        To set clear expectations for the raw data coming into the system -> Option A
      4. Quick Check:

        Raw data contracts = clear data expectations [OK]
      Hint: Raw data contracts = clear data rules for sources [OK]
      Common Mistakes:
      • Thinking contracts speed up data loading
      • Assuming contracts fix data automatically
      • Confusing contracts with reporting tools
      2. Which of the following is the correct way to define a source in a dbt YAML file for raw data contracts?
      easy
      A. source: name: raw_data table: users columns: - id tests: [not_null, unique]
      B. sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique]
      C. sources: raw_data: users: columns: - id tests: [not_null, unique]
      D. source: - raw_data - users - columns: - id tests: [not_null, unique]

      Solution

      1. Step 1: Recall dbt source YAML structure

        Sources are defined under sources: as a list with name and tables keys.
      2. Step 2: Match correct indentation and keys

        sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] correctly uses sources list, name, tables, and columns with tests.
      3. Final Answer:

        sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] -> Option B
      4. Quick Check:

        dbt source YAML = list with name, tables, columns [OK]
      Hint: Sources use list with name and tables keys in YAML [OK]
      Common Mistakes:
      • Using singular 'source' instead of 'sources'
      • Incorrect indentation or missing keys
      • Listing columns without proper nesting
      3. Given this source definition in dbt YAML:
      sources:
        - name: raw_sales
          tables:
            - name: transactions
              columns:
                - name: transaction_id
                  tests: [not_null, unique]
                - name: amount
                  tests: [not_null]
      
      What happens if a transaction has a null amount when running dbt tests?
      medium
      A. The test will pass because nulls are allowed by default
      B. The test will be skipped for the 'amount' column
      C. dbt will automatically fill null amounts with zero
      D. The test will fail, alerting that a null value exists in 'amount'

      Solution

      1. Step 1: Understand the 'not_null' test in dbt

        The 'not_null' test checks that no null values exist in the specified column.
      2. Step 2: Predict test behavior on null data

        If a null value exists in 'amount', the 'not_null' test will fail and alert the user.
      3. Final Answer:

        The test will fail, alerting that a null value exists in 'amount' -> Option D
      4. Quick Check:

        'not_null' test fails on nulls [OK]
      Hint: 'not_null' test fails if any nulls found [OK]
      Common Mistakes:
      • Thinking dbt fills nulls automatically
      • Assuming tests pass by default
      • Believing tests skip columns with nulls
      4. You wrote this source YAML in dbt:
      sources:
        - name: raw_data
          tables:
            - name: customers
              columns:
                - name: customer_id
                  tests: not_null, unique
      
      When running dbt, you get a syntax error. What is the problem?
      medium
      A. The tests list should be inside square brackets [ ]
      B. The 'columns' key should be 'column'
      C. The 'tables' key should be a dictionary, not a list
      D. The source name cannot be 'raw_data'

      Solution

      1. Step 1: Check YAML syntax for tests

        Tests must be listed as a YAML list inside square brackets or as a list with dashes.
      2. Step 2: Identify the error in tests format

        Writing tests: not_null, unique is invalid YAML; it should be tests: [not_null, unique].
      3. Final Answer:

        The tests list should be inside square brackets [ ] -> Option A
      4. Quick Check:

        Tests need brackets or dashes in YAML [OK]
      Hint: Tests must be a list with brackets or dashes [OK]
      Common Mistakes:
      • Writing tests as comma-separated string without brackets
      • Using wrong key names like 'column' instead of 'columns'
      • Misunderstanding list vs dictionary in YAML
      5. You want to ensure your raw data source in dbt matches a strict contract: every 'order_id' must be unique and not null, and 'order_date' must be present and in date format. How should you define this in your source YAML to catch issues early?
      hard
      A. Define the source with columns and add tests only for 'order_id' as unique, ignoring 'order_date'
      B. Define the source with columns but no tests; rely on downstream models to catch errors
      C. Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates
      D. Define the source with columns and add tests for 'order_date' only, ignoring 'order_id'

      Solution

      1. Step 1: Identify required tests for 'order_id'

        To ensure uniqueness and no nulls, use tests [not_null, unique] on 'order_id'.
      2. Step 2: Define tests for 'order_date'

        To ensure presence and valid dates, use [not_null] and a test like 'accepted_values' or a custom test for date format.
      3. Step 3: Combine tests in source YAML

        Include both columns with their respective tests to catch issues early at the source level.
      4. Final Answer:

        Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates -> Option C
      5. Quick Check:

        Raw data contracts include all critical tests [OK]
      Hint: Test all critical columns with not_null and uniqueness [OK]
      Common Mistakes:
      • Skipping tests on important columns
      • Relying on downstream models for raw data validation
      • Not testing data formats like dates