Bird
Raised Fist0
dbtdata~15 mins

Why sources define raw data contracts in dbt - Why It Works This Way

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Why sources define raw data contracts
What is it?
Raw data contracts are agreements that define the exact shape, quality, and expectations of data coming from a source system before it is processed. In dbt, sources define these contracts to ensure that the data entering the transformation pipeline meets certain standards. This helps teams catch errors early and maintain trust in their data. Essentially, it is a way to say, 'This is what the raw data should look like before we start working with it.'
Why it matters
Without raw data contracts, teams risk working with unexpected or broken data, which can cause errors downstream and lead to wrong decisions. Defining these contracts helps catch problems early, saving time and effort. It also creates clear communication between data producers and consumers, making data pipelines more reliable and easier to maintain. Without this, data teams would spend more time fixing issues than analyzing data.
Where it fits
Before learning about raw data contracts, you should understand basic data modeling and dbt sources. After this, you can learn about data testing, data quality frameworks, and advanced dbt features like snapshots and exposures. This topic sits at the start of the data transformation journey, focusing on input validation.
Mental Model
Core Idea
A raw data contract is a clear promise about what the incoming data looks like, so everyone knows what to expect before using it.
Think of it like...
It's like ordering a package online and agreeing with the seller on exactly what should be inside the box before it ships, so you can check it immediately when it arrives.
┌─────────────────────────────┐
│       Raw Data Source       │
│  (Data with expected shape) │
└─────────────┬───────────────┘
              │
              │ Defines contract:
              │ - Columns
              │ - Data types
              │ - Nullability
              │ - Constraints
              ▼
┌─────────────────────────────┐
│    dbt Source Definition    │
│  (Raw Data Contract Layer)  │
└─────────────┬───────────────┘
              │
              │ Validates data meets contract
              ▼
┌─────────────────────────────┐
│   Data Transformation in    │
│           dbt Models        │
└─────────────────────────────┘
Build-Up - 6 Steps
1
FoundationUnderstanding raw data sources
🤔
Concept: Raw data sources are the original data tables or files before any changes or cleaning.
Raw data is like the untouched ingredients in a kitchen. It comes directly from systems like databases, logs, or APIs. This data can have missing values, unexpected formats, or errors. Knowing what raw data looks like is the first step to working with it safely.
Result
You can identify where your data starts and what its initial form is.
Understanding raw data sources helps you realize why checking data quality early is crucial.
2
FoundationWhat is a data contract in dbt sources
🤔
Concept: A data contract in dbt sources defines the expected structure and rules for raw data.
In dbt, you declare sources with details like column names, data types, and whether columns can be null. This declaration acts as a contract that the raw data must follow. If the data breaks this contract, dbt can alert you before transformations run.
Result
You have a formal agreement on what raw data should look like.
Knowing that sources define contracts helps prevent surprises during data transformations.
3
IntermediateHow dbt tests enforce raw data contracts
🤔Before reading on: Do you think dbt tests run automatically or need manual triggers? Commit to your answer.
Concept: dbt uses tests to check if raw data meets the contract rules like no missing values or correct data types.
You can add tests in your source definitions, such as 'not_null' or 'unique' on columns. When you run dbt tests, it checks the raw data against these rules. If any test fails, it means the data contract is broken and needs fixing.
Result
You get immediate feedback if raw data does not meet expectations.
Understanding how tests enforce contracts helps you catch data issues early and maintain pipeline health.
4
IntermediateBenefits of defining raw data contracts
🤔Before reading on: Do you think raw data contracts mainly help data producers, consumers, or both? Commit to your answer.
Concept: Raw data contracts improve communication, reliability, and error detection between data producers and consumers.
By defining contracts, producers know what data quality is expected, and consumers trust the data they receive. This reduces back-and-forth debugging and speeds up data projects. It also helps automate monitoring and alerting for data issues.
Result
Teams work more efficiently with fewer data surprises.
Knowing the benefits motivates consistent use of raw data contracts in projects.
5
AdvancedHandling contract changes and evolution
🤔Before reading on: Should raw data contracts be rigid forever or evolve with data? Commit to your answer.
Concept: Raw data contracts must adapt as source data changes, requiring versioning and communication.
When source systems update schemas or add columns, contracts need updating too. Teams should version contracts and communicate changes to avoid breaking downstream models. dbt's source freshness and tests help detect when contracts no longer match reality.
Result
Data pipelines remain stable despite source changes.
Understanding contract evolution prevents pipeline failures and supports agile data development.
6
ExpertRaw data contracts in large-scale production systems
🤔Before reading on: Do you think raw data contracts alone guarantee data quality in complex systems? Commit to your answer.
Concept: In large systems, raw data contracts are part of a broader data governance and quality strategy.
Contracts define expectations, but monitoring, alerting, lineage tracking, and automated remediation complement them. Teams integrate contracts with data catalogs and observability tools. Contracts also help in compliance and auditing by documenting data agreements.
Result
Robust, trustworthy data pipelines that scale and comply with regulations.
Knowing contracts are one piece of a bigger quality puzzle helps design better data systems.
Under the Hood
When you define a source in dbt with a raw data contract, dbt stores metadata about expected columns, types, and constraints. During dbt runs, tests query the source tables to check these expectations. If a test query returns unexpected results (like nulls where not allowed), dbt flags a failure. This happens before transformations, so errors are caught early. Internally, dbt uses SQL queries generated from test definitions to validate contracts.
Why designed this way?
Raw data contracts were designed to formalize expectations between data producers and consumers, reducing guesswork and errors. Before contracts, teams relied on informal knowledge or manual checks, which caused delays and mistakes. The contract approach fits well with dbt's philosophy of version-controlled, test-driven data transformations. It balances flexibility with safety by allowing contracts to evolve but enforcing them automatically.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Raw Data Table│──────▶│ dbt Source    │──────▶│ dbt Tests Run │
│ (Production)  │       │ Definition    │       │ (SQL Queries) │
└───────────────┘       └───────────────┘       └───────────────┘
         │                      │                       │
         │ Data flows in        │ Contract defines      │ Tests validate
         │                      │ expected schema       │ data matches
         ▼                      ▼                       ▼
   ┌───────────────┐       ┌───────────────┐       ┌───────────────┐
   │ Data Pipeline │       │ Contract      │       │ Test Results  │
   │ Execution    │       │ Metadata      │       │ Pass/Fail     │
   └───────────────┘       └───────────────┘       └───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Do you think raw data contracts fix all data quality issues automatically? Commit to yes or no.
Common Belief:Raw data contracts automatically fix any data problems in the source.
Tap to reveal reality
Reality:Raw data contracts only define expectations and detect issues; they do not fix data automatically.
Why it matters:Believing contracts fix data leads to ignoring necessary data cleaning and remediation steps, causing persistent errors.
Quick: Do you think raw data contracts are only useful for big data teams? Commit to yes or no.
Common Belief:Only large teams or companies need raw data contracts.
Tap to reveal reality
Reality:Any team working with data benefits from contracts because they prevent errors and improve communication.
Why it matters:Small teams without contracts may waste time debugging avoidable data issues.
Quick: Do you think raw data contracts prevent all downstream errors? Commit to yes or no.
Common Belief:If raw data contracts pass, downstream data is always correct.
Tap to reveal reality
Reality:Contracts only check raw data shape and quality; errors can still occur in transformations or business logic.
Why it matters:Overreliance on contracts can cause missed errors later in the pipeline.
Quick: Do you think raw data contracts are static and never change? Commit to yes or no.
Common Belief:Once defined, raw data contracts should never be updated.
Tap to reveal reality
Reality:Contracts must evolve as source data changes to stay accurate and useful.
Why it matters:Ignoring contract updates leads to false errors or missed issues, breaking trust in the system.
Expert Zone
1
Raw data contracts can be extended with custom tests to capture domain-specific rules beyond basic schema checks.
2
Contracts serve as living documentation, so keeping them updated improves onboarding and cross-team collaboration.
3
In distributed teams, contracts act as formal SLAs (service-level agreements) between data producers and consumers.
When NOT to use
Raw data contracts are less useful when working with highly unstructured or rapidly changing data where schema enforcement is impractical. In such cases, schema-on-read approaches or data lakes with flexible schemas may be better. Also, for exploratory data analysis, strict contracts can slow down iteration.
Production Patterns
In production, teams integrate raw data contracts with CI/CD pipelines to run tests automatically on new data arrivals. Contracts are combined with data freshness checks and alerting systems. Version control of contracts allows rollback and audit trails. Contracts are also linked to data catalogs for governance and compliance.
Connections
API contracts
Similar pattern
Both raw data contracts and API contracts define clear expectations between producers and consumers to prevent integration errors.
Software unit testing
Builds-on
Raw data contracts are like unit tests for data inputs, ensuring each piece meets criteria before further processing.
Legal contracts
Conceptual analogy
Just as legal contracts set clear terms to avoid disputes, raw data contracts set clear data expectations to avoid pipeline failures.
Common Pitfalls
#1Ignoring contract test failures and proceeding with transformations.
Wrong approach:dbt run --models my_model # runs despite source test failures
Correct approach:dbt test --select source:my_source # run tests first and fix failures before transformations
Root cause:Misunderstanding that tests are warnings rather than gatekeepers for data quality.
#2Defining contracts too loosely, allowing nulls or wrong types without checks.
Wrong approach:sources: - name: raw_data tables: - name: users columns: - name: id tests: [] # no tests defined
Correct approach:sources: - name: raw_data tables: - name: users columns: - name: id tests: - not_null - unique
Root cause:Underestimating the importance of strict contracts leads to undetected data issues.
#3Not updating contracts when source schema changes.
Wrong approach:Leaving old source definitions after source adds new columns or changes types.
Correct approach:Updating source definitions and tests to reflect new schema changes promptly.
Root cause:Assuming contracts are one-time setup rather than living documents.
Key Takeaways
Raw data contracts define clear expectations for incoming data to ensure quality and consistency.
They act as early warning systems by validating raw data before transformations run.
Contracts improve communication between data producers and consumers, reducing errors and delays.
Maintaining and evolving contracts is essential to keep data pipelines reliable as sources change.
In production, contracts are part of a broader data quality and governance strategy, not a standalone solution.

Practice

(1/5)
1. Why do we define raw data contracts in dbt sources?
easy
A. To set clear expectations for the raw data coming into the system
B. To speed up the data loading process
C. To automatically fix data errors
D. To create visual reports from raw data

Solution

  1. Step 1: Understand the purpose of raw data contracts

    Raw data contracts define what the incoming data should look like, such as expected columns and types.
  2. Step 2: Identify the main benefit in dbt context

    They help teams know what to expect and catch issues early, not speed up loading or fix errors automatically.
  3. Final Answer:

    To set clear expectations for the raw data coming into the system -> Option A
  4. Quick Check:

    Raw data contracts = clear data expectations [OK]
Hint: Raw data contracts = clear data rules for sources [OK]
Common Mistakes:
  • Thinking contracts speed up data loading
  • Assuming contracts fix data automatically
  • Confusing contracts with reporting tools
2. Which of the following is the correct way to define a source in a dbt YAML file for raw data contracts?
easy
A. source: name: raw_data table: users columns: - id tests: [not_null, unique]
B. sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique]
C. sources: raw_data: users: columns: - id tests: [not_null, unique]
D. source: - raw_data - users - columns: - id tests: [not_null, unique]

Solution

  1. Step 1: Recall dbt source YAML structure

    Sources are defined under sources: as a list with name and tables keys.
  2. Step 2: Match correct indentation and keys

    sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] correctly uses sources list, name, tables, and columns with tests.
  3. Final Answer:

    sources: - name: raw_data tables: - name: users columns: - name: id tests: [not_null, unique] -> Option B
  4. Quick Check:

    dbt source YAML = list with name, tables, columns [OK]
Hint: Sources use list with name and tables keys in YAML [OK]
Common Mistakes:
  • Using singular 'source' instead of 'sources'
  • Incorrect indentation or missing keys
  • Listing columns without proper nesting
3. Given this source definition in dbt YAML:
sources:
  - name: raw_sales
    tables:
      - name: transactions
        columns:
          - name: transaction_id
            tests: [not_null, unique]
          - name: amount
            tests: [not_null]
What happens if a transaction has a null amount when running dbt tests?
medium
A. The test will pass because nulls are allowed by default
B. The test will be skipped for the 'amount' column
C. dbt will automatically fill null amounts with zero
D. The test will fail, alerting that a null value exists in 'amount'

Solution

  1. Step 1: Understand the 'not_null' test in dbt

    The 'not_null' test checks that no null values exist in the specified column.
  2. Step 2: Predict test behavior on null data

    If a null value exists in 'amount', the 'not_null' test will fail and alert the user.
  3. Final Answer:

    The test will fail, alerting that a null value exists in 'amount' -> Option D
  4. Quick Check:

    'not_null' test fails on nulls [OK]
Hint: 'not_null' test fails if any nulls found [OK]
Common Mistakes:
  • Thinking dbt fills nulls automatically
  • Assuming tests pass by default
  • Believing tests skip columns with nulls
4. You wrote this source YAML in dbt:
sources:
  - name: raw_data
    tables:
      - name: customers
        columns:
          - name: customer_id
            tests: not_null, unique
When running dbt, you get a syntax error. What is the problem?
medium
A. The tests list should be inside square brackets [ ]
B. The 'columns' key should be 'column'
C. The 'tables' key should be a dictionary, not a list
D. The source name cannot be 'raw_data'

Solution

  1. Step 1: Check YAML syntax for tests

    Tests must be listed as a YAML list inside square brackets or as a list with dashes.
  2. Step 2: Identify the error in tests format

    Writing tests: not_null, unique is invalid YAML; it should be tests: [not_null, unique].
  3. Final Answer:

    The tests list should be inside square brackets [ ] -> Option A
  4. Quick Check:

    Tests need brackets or dashes in YAML [OK]
Hint: Tests must be a list with brackets or dashes [OK]
Common Mistakes:
  • Writing tests as comma-separated string without brackets
  • Using wrong key names like 'column' instead of 'columns'
  • Misunderstanding list vs dictionary in YAML
5. You want to ensure your raw data source in dbt matches a strict contract: every 'order_id' must be unique and not null, and 'order_date' must be present and in date format. How should you define this in your source YAML to catch issues early?
hard
A. Define the source with columns and add tests only for 'order_id' as unique, ignoring 'order_date'
B. Define the source with columns but no tests; rely on downstream models to catch errors
C. Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates
D. Define the source with columns and add tests for 'order_date' only, ignoring 'order_id'

Solution

  1. Step 1: Identify required tests for 'order_id'

    To ensure uniqueness and no nulls, use tests [not_null, unique] on 'order_id'.
  2. Step 2: Define tests for 'order_date'

    To ensure presence and valid dates, use [not_null] and a test like 'accepted_values' or a custom test for date format.
  3. Step 3: Combine tests in source YAML

    Include both columns with their respective tests to catch issues early at the source level.
  4. Final Answer:

    Define the source with columns 'order_id' and 'order_date' and add tests: 'order_id' with [not_null, unique], 'order_date' with [not_null, accepted_values] for dates -> Option C
  5. Quick Check:

    Raw data contracts include all critical tests [OK]
Hint: Test all critical columns with not_null and uniqueness [OK]
Common Mistakes:
  • Skipping tests on important columns
  • Relying on downstream models for raw data validation
  • Not testing data formats like dates