dbtdata~15 mins

Why sources define raw data contracts in dbt - Why It Works This Way

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - Why sources define raw data contracts

What is it?

Raw data contracts are agreements that define the exact shape, quality, and expectations of data coming from a source system before it is processed. In dbt, sources define these contracts to ensure that the data entering the transformation pipeline meets certain standards. This helps teams catch errors early and maintain trust in their data. Essentially, it is a way to say, 'This is what the raw data should look like before we start working with it.'

Why it matters

Without raw data contracts, teams risk working with unexpected or broken data, which can cause errors downstream and lead to wrong decisions. Defining these contracts helps catch problems early, saving time and effort. It also creates clear communication between data producers and consumers, making data pipelines more reliable and easier to maintain. Without this, data teams would spend more time fixing issues than analyzing data.

Where it fits

Before learning about raw data contracts, you should understand basic data modeling and dbt sources. After this, you can learn about data testing, data quality frameworks, and advanced dbt features like snapshots and exposures. This topic sits at the start of the data transformation journey, focusing on input validation.

Mental Model

Core Idea

A raw data contract is a clear promise about what the incoming data looks like, so everyone knows what to expect before using it.

Think of it like...

It's like ordering a package online and agreeing with the seller on exactly what should be inside the box before it ships, so you can check it immediately when it arrives.

┌─────────────────────────────┐
│       Raw Data Source       │
│  (Data with expected shape) │
└─────────────┬───────────────┘
              │
              │ Defines contract:
              │ - Columns
              │ - Data types
              │ - Nullability
              │ - Constraints
              ▼
┌─────────────────────────────┐
│    dbt Source Definition    │
│  (Raw Data Contract Layer)  │
└─────────────┬───────────────┘
              │
              │ Validates data meets contract
              ▼
┌─────────────────────────────┐
│   Data Transformation in    │
│           dbt Models        │
└─────────────────────────────┘

Build-Up - 6 Steps

FoundationUnderstanding raw data sources

Concept: Raw data sources are the original data tables or files before any changes or cleaning.

Raw data is like the untouched ingredients in a kitchen. It comes directly from systems like databases, logs, or APIs. This data can have missing values, unexpected formats, or errors. Knowing what raw data looks like is the first step to working with it safely.

Result

You can identify where your data starts and what its initial form is.

Understanding raw data sources helps you realize why checking data quality early is crucial.

FoundationWhat is a data contract in dbt sources

IntermediateHow dbt tests enforce raw data contracts

IntermediateBenefits of defining raw data contracts

AdvancedHandling contract changes and evolution

ExpertRaw data contracts in large-scale production systems

Under the Hood

When you define a source in dbt with a raw data contract, dbt stores metadata about expected columns, types, and constraints. During dbt runs, tests query the source tables to check these expectations. If a test query returns unexpected results (like nulls where not allowed), dbt flags a failure. This happens before transformations, so errors are caught early. Internally, dbt uses SQL queries generated from test definitions to validate contracts.

Why designed this way?

Raw data contracts were designed to formalize expectations between data producers and consumers, reducing guesswork and errors. Before contracts, teams relied on informal knowledge or manual checks, which caused delays and mistakes. The contract approach fits well with dbt's philosophy of version-controlled, test-driven data transformations. It balances flexibility with safety by allowing contracts to evolve but enforcing them automatically.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Raw Data Table│──────▶│ dbt Source    │──────▶│ dbt Tests Run │
│ (Production)  │       │ Definition    │       │ (SQL Queries) │
└───────────────┘       └───────────────┘       └───────────────┘
         │                      │                       │
         │ Data flows in        │ Contract defines      │ Tests validate
         │                      │ expected schema       │ data matches
         ▼                      ▼                       ▼
   ┌───────────────┐       ┌───────────────┐       ┌───────────────┐
   │ Data Pipeline │       │ Contract      │       │ Test Results  │
   │ Execution    │       │ Metadata      │       │ Pass/Fail     │
   └───────────────┘       └───────────────┘       └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Do you think raw data contracts fix all data quality issues automatically? Commit to yes or no.

Common Belief:Raw data contracts automatically fix any data problems in the source.

Tap to reveal reality

Quick: Do you think raw data contracts are only useful for big data teams? Commit to yes or no.

Common Belief:Only large teams or companies need raw data contracts.

Tap to reveal reality

Quick: Do you think raw data contracts prevent all downstream errors? Commit to yes or no.

Common Belief:If raw data contracts pass, downstream data is always correct.

Tap to reveal reality

Quick: Do you think raw data contracts are static and never change? Commit to yes or no.

Common Belief:Once defined, raw data contracts should never be updated.

Tap to reveal reality

Expert Zone

Raw data contracts can be extended with custom tests to capture domain-specific rules beyond basic schema checks.

Contracts serve as living documentation, so keeping them updated improves onboarding and cross-team collaboration.

In distributed teams, contracts act as formal SLAs (service-level agreements) between data producers and consumers.

When NOT to use

Raw data contracts are less useful when working with highly unstructured or rapidly changing data where schema enforcement is impractical. In such cases, schema-on-read approaches or data lakes with flexible schemas may be better. Also, for exploratory data analysis, strict contracts can slow down iteration.

Production Patterns

In production, teams integrate raw data contracts with CI/CD pipelines to run tests automatically on new data arrivals. Contracts are combined with data freshness checks and alerting systems. Version control of contracts allows rollback and audit trails. Contracts are also linked to data catalogs for governance and compliance.

Connections

API contracts

Similar pattern

Both raw data contracts and API contracts define clear expectations between producers and consumers to prevent integration errors.

Software unit testing

Builds-on

Raw data contracts are like unit tests for data inputs, ensuring each piece meets criteria before further processing.

Legal contracts

Conceptual analogy

Just as legal contracts set clear terms to avoid disputes, raw data contracts set clear data expectations to avoid pipeline failures.

Common Pitfalls

#1Ignoring contract test failures and proceeding with transformations.

Wrong approach:dbt run --models my_model # runs despite source test failures

Correct approach:dbt test --select source:my_source # run tests first and fix failures before transformations

Root cause:Misunderstanding that tests are warnings rather than gatekeepers for data quality.

#2Defining contracts too loosely, allowing nulls or wrong types without checks.

Wrong approach:sources: - name: raw_data tables: - name: users columns: - name: id tests: [] # no tests defined

Correct approach:sources: - name: raw_data tables: - name: users columns: - name: id tests: - not_null - unique

Root cause:Underestimating the importance of strict contracts leads to undetected data issues.

#3Not updating contracts when source schema changes.

Wrong approach:Leaving old source definitions after source adds new columns or changes types.

Correct approach:Updating source definitions and tests to reflect new schema changes promptly.

Root cause:Assuming contracts are one-time setup rather than living documents.

Key Takeaways

Raw data contracts define clear expectations for incoming data to ensure quality and consistency.

They act as early warning systems by validating raw data before transformations run.

Contracts improve communication between data producers and consumers, reducing errors and delays.

Maintaining and evolving contracts is essential to keep data pipelines reliable as sources change.

In production, contracts are part of a broader data quality and governance strategy, not a standalone solution.

Practice

(1/5)

1. Why do we define raw data contracts in dbt sources?

easy

A. To set clear expectations for the raw data coming into the system

B. To speed up the data loading process

C. To automatically fix data errors

D. To create visual reports from raw data

Why sources define raw data contracts in dbt - Why It Works This Way

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of raw data contracts

Step 2: Identify the main benefit in dbt context

Final Answer:

Quick Check:

Solution

Step 1: Recall dbt source YAML structure

Step 2: Match correct indentation and keys

Final Answer:

Quick Check:

Solution

Step 1: Understand the 'not_null' test in dbt

Step 2: Predict test behavior on null data

Final Answer:

Quick Check:

Solution

Step 1: Check YAML syntax for tests

Step 2: Identify the error in tests format

Final Answer:

Quick Check:

Solution

Step 1: Identify required tests for 'order_id'

Step 2: Define tests for 'order_date'

Step 3: Combine tests in source YAML

Final Answer:

Quick Check: