Loading CSV seeds in dbt - Time & Space Complexity
When loading CSV seeds in dbt, we want to understand how the time to load data changes as the CSV file grows larger.
We ask: How does the loading time increase when the CSV has more rows?
Analyze the time complexity of the following dbt seed loading snippet.
-- dbt seed configuration example (in dbt_project.yml)
seeds:
my_project:
my_seed:
+header: true
+delimiter: ','
-- dbt command to load seed
-- dbt seed --select my_seed
This snippet shows how dbt loads a CSV seed file into the data warehouse table.
Loading a CSV seed involves reading each row and inserting it into the database.
- Primary operation: Reading and inserting each row from the CSV file.
- How many times: Once per row in the CSV file.
As the number of rows in the CSV file increases, the time to load grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 row reads and inserts |
| 100 | About 100 row reads and inserts |
| 1000 | About 1000 row reads and inserts |
Pattern observation: Doubling the rows roughly doubles the work and time needed.
Time Complexity: O(n)
This means the loading time grows linearly with the number of rows in the CSV file.
[X] Wrong: "Loading a CSV seed is instant no matter the size."
[OK] Correct: Each row must be read and inserted, so bigger files take more time.
Understanding how data loading scales helps you explain performance in real projects and shows you think about efficiency.
"What if the CSV file had multiple columns with complex data types? How would that affect the time complexity?"