source() function for raw tables in dbt - Time & Space Complexity
We want to understand how the time to run a dbt model using the source() function changes as the size of the raw table grows.
Specifically, how does the data volume affect the work dbt does when reading from a raw table?
Analyze the time complexity of the following dbt code snippet.
select *
from {{ source('raw_schema', 'raw_table') }}
where event_date >= '2024-01-01'
This code reads data from a raw table using source() and filters rows by date.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning rows in the raw table to apply the date filter.
- How many times: Once for each row in the raw table.
As the number of rows in the raw table grows, the number of rows scanned grows at the same rate.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 row checks |
| 100 | About 100 row checks |
| 1000 | About 1000 row checks |
Pattern observation: The work grows directly with the number of rows; doubling rows doubles work.
Time Complexity: O(n)
This means the time to run grows linearly with the number of rows in the raw table.
[X] Wrong: "Using source() is instant and does not depend on table size."
[OK] Correct: The source() function points to the raw table, so the database must scan rows to apply filters, which takes longer as the table grows.
Understanding how data size affects query time is key for writing efficient dbt models and working with raw data sources.
What if we added an index on event_date? How would the time complexity change?