Seeds for static reference data in dbt - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
We want to understand how the time to load static reference data using seeds in dbt changes as the data size grows.
How does the number of rows in the seed file affect the loading time?
Analyze the time complexity of this dbt seed loading snippet.
-- seeds/my_reference_data.csv
id,name
1,Category A
2,Category B
3,Category C
-- dbt_project.yml
seeds:
my_project:
my_reference_data:
file: my_reference_data.csv
-- Usage in model
select * from {{ ref('my_reference_data') }}
This code loads a static CSV file as a seed table and references it in a model.
Look at what happens when dbt loads the seed data.
- Primary operation: Reading each row from the CSV file and inserting it into the database table.
- How many times: Once per row in the seed file.
As the number of rows in the seed file increases, the operations increase proportionally.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 row reads and inserts |
| 100 | 100 row reads and inserts |
| 1000 | 1000 row reads and inserts |
Pattern observation: The work grows directly with the number of rows; doubling rows doubles the work.
Time Complexity: O(n)
This means the time to load seeds grows linearly with the number of rows in the seed file.
[X] Wrong: "Loading seeds is instant no matter how big the file is."
[OK] Correct: Each row must be read and inserted, so bigger files take more time.
Understanding how seed loading scales helps you explain data pipeline performance clearly and shows you grasp practical data engineering concepts.
What if we compressed the seed file and loaded it directly? How might that affect the time complexity?
Practice
seeds in dbt?Solution
Step 1: Understand what seeds are in dbt
Seeds are CSV files that contain static reference data you want to load into your database.Step 2: Identify the main use of seeds
Seeds let you easily add fixed data tables without writing SQL queries.Final Answer:
To load static reference data from CSV files into your database -> Option BQuick Check:
Seeds = static CSV data load [OK]
- Confusing seeds with models that run SQL
- Thinking seeds schedule dbt runs
- Assuming seeds are for dynamic data
Solution
Step 1: Recall dbt commands related to seeds
The commanddbt seedloads CSV seed files into the database as tables.Step 2: Differentiate from other commands
dbt runruns models,dbt testruns tests, anddbt compilecompiles SQL but does not load seeds.Final Answer:
dbt seed -> Option CQuick Check:
Load seeds = dbt seed [OK]
- Using 'dbt run' to load seeds
- Confusing 'dbt test' with loading data
- Thinking 'dbt compile' loads data
countries.csv with columns id and name, what will be the output of this dbt model SQL?select * from {{ ref('countries') }}Solution
Step 1: Understand how seeds are referenced in dbt
Seeds become tables in the database and can be referenced usingref()like models.Step 2: Predict the query output
The query selects all columns and rows from the seed tablecountries, so it returns the full CSV data.Final Answer:
A table with all rows and columns from countries.csv -> Option AQuick Check:
ref(seed) = full seed table [OK]
- Thinking seeds cannot be referenced
- Assuming seeds load empty tables
- Expecting partial columns only
dbt seed but your seed table did not update. Which of these is the most likely cause?Solution
Step 1: Check seed discovery mechanism
dbt automatically discovers and loads CSV files from theseeds/folder withdbt seed.Step 2: Identify why table doesn't update
If the CSV file is missing from theseeds/folder,dbt seedruns successfully but skips that seed, leaving the table unchanged.Final Answer:
You forgot to add the seed CSV file in the seeds folder -> Option AQuick Check:
Seeds folder missing CSV = no update [OK]
- Thinking seeds require config in dbt_project.yml
- Confusing dbt run with dbt seed
- CSV syntax errors (would cause explicit failure)
currencies.csv with columns code and symbol inside a model to join with a transactions table on currency_code. Which is the correct way to write the join in your model SQL?Solution
Step 1: Recall how to reference seeds in dbt models
Seeds are referenced using{{ ref('seed_name') }}to get the table name in SQL.Step 2: Identify the correct join syntax
Joiningtransactionswith{{ ref('currencies') }}correctly uses the seed table in the join.Final Answer:
select t.*, c.symbol from transactions t join {{ ref('currencies') }} c on t.currency_code = c.code -> Option DQuick Check:
Join seed with ref() = correct [OK]
- Using raw CSV filename in SQL
- Forgetting to use ref() for seeds
- Trying to use a non-existent seed() function
