Seeds for static reference data in dbt - Time & Space Complexity
We want to understand how the time to load static reference data using seeds in dbt changes as the data size grows.
How does the number of rows in the seed file affect the loading time?
Analyze the time complexity of this dbt seed loading snippet.
-- seeds/my_reference_data.csv
id,name
1,Category A
2,Category B
3,Category C
-- dbt_project.yml
seeds:
my_project:
my_reference_data:
file: my_reference_data.csv
-- Usage in model
select * from {{ ref('my_reference_data') }}
This code loads a static CSV file as a seed table and references it in a model.
Look at what happens when dbt loads the seed data.
- Primary operation: Reading each row from the CSV file and inserting it into the database table.
- How many times: Once per row in the seed file.
As the number of rows in the seed file increases, the operations increase proportionally.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 row reads and inserts |
| 100 | 100 row reads and inserts |
| 1000 | 1000 row reads and inserts |
Pattern observation: The work grows directly with the number of rows; doubling rows doubles the work.
Time Complexity: O(n)
This means the time to load seeds grows linearly with the number of rows in the seed file.
[X] Wrong: "Loading seeds is instant no matter how big the file is."
[OK] Correct: Each row must be read and inserted, so bigger files take more time.
Understanding how seed loading scales helps you explain data pipeline performance clearly and shows you grasp practical data engineering concepts.
What if we compressed the seed file and loaded it directly? How might that affect the time complexity?