0
0
dbtdata~15 mins

Why sources define raw data contracts in dbt - Why It Works This Way

Choose your learning style9 modes available
Overview - Why sources define raw data contracts
What is it?
Raw data contracts are agreements that define the exact shape, quality, and expectations of data coming from a source system before it is processed. In dbt, sources define these contracts to ensure that the data entering the transformation pipeline meets certain standards. This helps teams catch errors early and maintain trust in their data. Essentially, it is a way to say, 'This is what the raw data should look like before we start working with it.'
Why it matters
Without raw data contracts, teams risk working with unexpected or broken data, which can cause errors downstream and lead to wrong decisions. Defining these contracts helps catch problems early, saving time and effort. It also creates clear communication between data producers and consumers, making data pipelines more reliable and easier to maintain. Without this, data teams would spend more time fixing issues than analyzing data.
Where it fits
Before learning about raw data contracts, you should understand basic data modeling and dbt sources. After this, you can learn about data testing, data quality frameworks, and advanced dbt features like snapshots and exposures. This topic sits at the start of the data transformation journey, focusing on input validation.
Mental Model
Core Idea
A raw data contract is a clear promise about what the incoming data looks like, so everyone knows what to expect before using it.
Think of it like...
It's like ordering a package online and agreeing with the seller on exactly what should be inside the box before it ships, so you can check it immediately when it arrives.
┌─────────────────────────────┐
│       Raw Data Source       │
│  (Data with expected shape) │
└─────────────┬───────────────┘
              │
              │ Defines contract:
              │ - Columns
              │ - Data types
              │ - Nullability
              │ - Constraints
              ▼
┌─────────────────────────────┐
│    dbt Source Definition    │
│  (Raw Data Contract Layer)  │
└─────────────┬───────────────┘
              │
              │ Validates data meets contract
              ▼
┌─────────────────────────────┐
│   Data Transformation in    │
│           dbt Models        │
└─────────────────────────────┘
Build-Up - 6 Steps
1
FoundationUnderstanding raw data sources
🤔
Concept: Raw data sources are the original data tables or files before any changes or cleaning.
Raw data is like the untouched ingredients in a kitchen. It comes directly from systems like databases, logs, or APIs. This data can have missing values, unexpected formats, or errors. Knowing what raw data looks like is the first step to working with it safely.
Result
You can identify where your data starts and what its initial form is.
Understanding raw data sources helps you realize why checking data quality early is crucial.
2
FoundationWhat is a data contract in dbt sources
🤔
Concept: A data contract in dbt sources defines the expected structure and rules for raw data.
In dbt, you declare sources with details like column names, data types, and whether columns can be null. This declaration acts as a contract that the raw data must follow. If the data breaks this contract, dbt can alert you before transformations run.
Result
You have a formal agreement on what raw data should look like.
Knowing that sources define contracts helps prevent surprises during data transformations.
3
IntermediateHow dbt tests enforce raw data contracts
🤔Before reading on: Do you think dbt tests run automatically or need manual triggers? Commit to your answer.
Concept: dbt uses tests to check if raw data meets the contract rules like no missing values or correct data types.
You can add tests in your source definitions, such as 'not_null' or 'unique' on columns. When you run dbt tests, it checks the raw data against these rules. If any test fails, it means the data contract is broken and needs fixing.
Result
You get immediate feedback if raw data does not meet expectations.
Understanding how tests enforce contracts helps you catch data issues early and maintain pipeline health.
4
IntermediateBenefits of defining raw data contracts
🤔Before reading on: Do you think raw data contracts mainly help data producers, consumers, or both? Commit to your answer.
Concept: Raw data contracts improve communication, reliability, and error detection between data producers and consumers.
By defining contracts, producers know what data quality is expected, and consumers trust the data they receive. This reduces back-and-forth debugging and speeds up data projects. It also helps automate monitoring and alerting for data issues.
Result
Teams work more efficiently with fewer data surprises.
Knowing the benefits motivates consistent use of raw data contracts in projects.
5
AdvancedHandling contract changes and evolution
🤔Before reading on: Should raw data contracts be rigid forever or evolve with data? Commit to your answer.
Concept: Raw data contracts must adapt as source data changes, requiring versioning and communication.
When source systems update schemas or add columns, contracts need updating too. Teams should version contracts and communicate changes to avoid breaking downstream models. dbt's source freshness and tests help detect when contracts no longer match reality.
Result
Data pipelines remain stable despite source changes.
Understanding contract evolution prevents pipeline failures and supports agile data development.
6
ExpertRaw data contracts in large-scale production systems
🤔Before reading on: Do you think raw data contracts alone guarantee data quality in complex systems? Commit to your answer.
Concept: In large systems, raw data contracts are part of a broader data governance and quality strategy.
Contracts define expectations, but monitoring, alerting, lineage tracking, and automated remediation complement them. Teams integrate contracts with data catalogs and observability tools. Contracts also help in compliance and auditing by documenting data agreements.
Result
Robust, trustworthy data pipelines that scale and comply with regulations.
Knowing contracts are one piece of a bigger quality puzzle helps design better data systems.
Under the Hood
When you define a source in dbt with a raw data contract, dbt stores metadata about expected columns, types, and constraints. During dbt runs, tests query the source tables to check these expectations. If a test query returns unexpected results (like nulls where not allowed), dbt flags a failure. This happens before transformations, so errors are caught early. Internally, dbt uses SQL queries generated from test definitions to validate contracts.
Why designed this way?
Raw data contracts were designed to formalize expectations between data producers and consumers, reducing guesswork and errors. Before contracts, teams relied on informal knowledge or manual checks, which caused delays and mistakes. The contract approach fits well with dbt's philosophy of version-controlled, test-driven data transformations. It balances flexibility with safety by allowing contracts to evolve but enforcing them automatically.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Raw Data Table│──────▶│ dbt Source    │──────▶│ dbt Tests Run │
│ (Production)  │       │ Definition    │       │ (SQL Queries) │
└───────────────┘       └───────────────┘       └───────────────┘
         │                      │                       │
         │ Data flows in        │ Contract defines      │ Tests validate
         │                      │ expected schema       │ data matches
         ▼                      ▼                       ▼
   ┌───────────────┐       ┌───────────────┐       ┌───────────────┐
   │ Data Pipeline │       │ Contract      │       │ Test Results  │
   │ Execution    │       │ Metadata      │       │ Pass/Fail     │
   └───────────────┘       └───────────────┘       └───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Do you think raw data contracts fix all data quality issues automatically? Commit to yes or no.
Common Belief:Raw data contracts automatically fix any data problems in the source.
Tap to reveal reality
Reality:Raw data contracts only define expectations and detect issues; they do not fix data automatically.
Why it matters:Believing contracts fix data leads to ignoring necessary data cleaning and remediation steps, causing persistent errors.
Quick: Do you think raw data contracts are only useful for big data teams? Commit to yes or no.
Common Belief:Only large teams or companies need raw data contracts.
Tap to reveal reality
Reality:Any team working with data benefits from contracts because they prevent errors and improve communication.
Why it matters:Small teams without contracts may waste time debugging avoidable data issues.
Quick: Do you think raw data contracts prevent all downstream errors? Commit to yes or no.
Common Belief:If raw data contracts pass, downstream data is always correct.
Tap to reveal reality
Reality:Contracts only check raw data shape and quality; errors can still occur in transformations or business logic.
Why it matters:Overreliance on contracts can cause missed errors later in the pipeline.
Quick: Do you think raw data contracts are static and never change? Commit to yes or no.
Common Belief:Once defined, raw data contracts should never be updated.
Tap to reveal reality
Reality:Contracts must evolve as source data changes to stay accurate and useful.
Why it matters:Ignoring contract updates leads to false errors or missed issues, breaking trust in the system.
Expert Zone
1
Raw data contracts can be extended with custom tests to capture domain-specific rules beyond basic schema checks.
2
Contracts serve as living documentation, so keeping them updated improves onboarding and cross-team collaboration.
3
In distributed teams, contracts act as formal SLAs (service-level agreements) between data producers and consumers.
When NOT to use
Raw data contracts are less useful when working with highly unstructured or rapidly changing data where schema enforcement is impractical. In such cases, schema-on-read approaches or data lakes with flexible schemas may be better. Also, for exploratory data analysis, strict contracts can slow down iteration.
Production Patterns
In production, teams integrate raw data contracts with CI/CD pipelines to run tests automatically on new data arrivals. Contracts are combined with data freshness checks and alerting systems. Version control of contracts allows rollback and audit trails. Contracts are also linked to data catalogs for governance and compliance.
Connections
API contracts
Similar pattern
Both raw data contracts and API contracts define clear expectations between producers and consumers to prevent integration errors.
Software unit testing
Builds-on
Raw data contracts are like unit tests for data inputs, ensuring each piece meets criteria before further processing.
Legal contracts
Conceptual analogy
Just as legal contracts set clear terms to avoid disputes, raw data contracts set clear data expectations to avoid pipeline failures.
Common Pitfalls
#1Ignoring contract test failures and proceeding with transformations.
Wrong approach:dbt run --models my_model # runs despite source test failures
Correct approach:dbt test --select source:my_source # run tests first and fix failures before transformations
Root cause:Misunderstanding that tests are warnings rather than gatekeepers for data quality.
#2Defining contracts too loosely, allowing nulls or wrong types without checks.
Wrong approach:sources: - name: raw_data tables: - name: users columns: - name: id tests: [] # no tests defined
Correct approach:sources: - name: raw_data tables: - name: users columns: - name: id tests: - not_null - unique
Root cause:Underestimating the importance of strict contracts leads to undetected data issues.
#3Not updating contracts when source schema changes.
Wrong approach:Leaving old source definitions after source adds new columns or changes types.
Correct approach:Updating source definitions and tests to reflect new schema changes promptly.
Root cause:Assuming contracts are one-time setup rather than living documents.
Key Takeaways
Raw data contracts define clear expectations for incoming data to ensure quality and consistency.
They act as early warning systems by validating raw data before transformations run.
Contracts improve communication between data producers and consumers, reducing errors and delays.
Maintaining and evolving contracts is essential to keep data pipelines reliable as sources change.
In production, contracts are part of a broader data quality and governance strategy, not a standalone solution.