0
0
dbtdata~5 mins

Why sources define raw data contracts in dbt - Performance Analysis

Choose your learning style9 modes available
Time Complexity: Why sources define raw data contracts
O(n)
Understanding Time Complexity

We want to understand how defining raw data contracts in sources affects the time it takes to process data in dbt.

Specifically, how does checking these contracts scale as data grows?

Scenario Under Consideration

Analyze the time complexity of this dbt source definition with raw data contracts.


  sources:
    - name: raw_data
      tables:
        - name: users
          columns:
            - name: id
              tests:
                - not_null
                - unique
            - name: email
              tests:
                - not_null
                - unique

This code defines a source with raw data contracts that check if columns are not null and unique.

Identify Repeating Operations

Look at what repeats when dbt runs these tests.

  • Primary operation: Scanning each row in the source table to check constraints.
  • How many times: Once per test per column, over all rows.
How Execution Grows With Input

As the number of rows grows, the checks take longer because each row is examined.

Input Size (n)Approx. Operations
10~40 (2 tests x 2 columns x 10 rows)
100~400 (2 tests x 2 columns x 100 rows)
1000~4000 (2 tests x 2 columns x 1000 rows)

Pattern observation: Operations grow roughly in direct proportion to the number of rows.

Final Time Complexity

Time Complexity: O(n)

This means the time to check raw data contracts grows linearly with the number of rows in the source.

Common Mistake

[X] Wrong: "Adding more columns with tests does not affect time much because tests run independently."

[OK] Correct: Each test on each column scans all rows, so more columns or tests multiply the work.

Interview Connect

Understanding how data validation scales helps you design efficient data pipelines and write better dbt models.

Self-Check

What if we added a test that compares values between two columns? How would the time complexity change?