Configuring sources in YAML in dbt - Performance & Efficiency
Start learning this pattern below
Jump into concepts and practice - no test required
When we configure sources in YAML for dbt, we define where our data comes from.
We want to understand how the time to process these configurations grows as we add more sources or tables.
Analyze the time complexity of this YAML source configuration snippet.
sources:
- name: sales_db
tables:
- name: customers
- name: orders
- name: products
This snippet defines one source with three tables listed under it.
Look at what repeats when dbt reads this YAML configuration.
- Primary operation: Reading each table entry under a source.
- How many times: Once for each table listed in the source.
As you add more tables to a source, dbt reads each one in turn.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 tables | 10 reads |
| 100 tables | 100 reads |
| 1000 tables | 1000 reads |
Pattern observation: The work grows directly with the number of tables.
Time Complexity: O(n)
This means the time to process source configurations grows linearly with the number of tables.
[X] Wrong: "Adding more tables won't affect processing time much because YAML is just text."
[OK] Correct: Even though YAML is text, dbt must read and process each table entry, so more tables mean more work.
Understanding how configuration size affects processing helps you explain efficiency in real projects.
What if we added multiple sources each with many tables? How would the time complexity change?
Practice
Solution
Step 1: Understand the role of source configuration
Source configuration in dbt YAML files defines where raw data tables are located in the database.Step 2: Differentiate from other dbt tasks
Writing SQL queries and scheduling runs are done elsewhere, not in source YAML files.Final Answer:
To tell dbt where to find raw data tables -> Option BQuick Check:
Source config = raw data location [OK]
- Confusing source config with SQL model code
- Thinking sources schedule runs
- Assuming sources create visualizations
Solution
Step 1: Recall correct YAML source structure
The correct syntax uses 'sources' as a list with 'name' and nested 'tables' list, each with a 'name'.Step 2: Compare options to syntax
sources: - name: raw_data tables: - name: customers matches the correct indentation and keys exactly.Final Answer:
sources: - name: raw_data tables: - name: customers -> Option CQuick Check:
Correct YAML keys and indentation = sources: - name: raw_data tables: - name: customers [OK]
- Using singular 'source' instead of 'sources'
- Missing 'name' key for tables
- Incorrect indentation breaking YAML structure
sources:
- name: sales_data
tables:
- name: transactions
loaded_at_field: transaction_dateSolution
Step 1: Locate the 'loaded_at_field' key in YAML
It is nested under the 'transactions' table inside the 'sales_data' source.Step 2: Identify the value assigned
The value assigned to 'loaded_at_field' is 'transaction_date'.Final Answer:
transaction_date -> Option AQuick Check:
loaded_at_field value = transaction_date [OK]
- Confusing source name with field value
- Picking table name instead of field value
- Misreading YAML indentation levels
sources:
- name: marketing_data
tables:
- name: leads
freshness:
warn_after:
count: 12
period: hours
error_after:
count: 1
period: daysSolution
Step 1: Understand dbt freshness period syntax
dbt freshness requires singular 'period' values like 'hour', 'day', 'minute'. Plural forms ('hours', 'days') are invalid and cause errors.Step 2: Check the YAML periods
'period: hours' and 'period: days' use plural, which dbt does not recognize.Step 3: Rule out other options
A: Counts logical (12 hours warn before 1 day/24 hours error). B: Indentation correct. C: Incorrect--error_after time must be *longer* than warn_after.Final Answer:
The 'period' values must be singular strings -> Option DQuick Check:
period: hour/day (singular only) [OK]
- Using plural periods ('hours', 'days')
- Incorrect YAML indentation
- Thinking error_after time should be shorter than warn_after
Solution
Step 1: Recall correct test syntax in source YAML
Tests are added under 'columns' with 'name' and a 'tests' list containing test names.Step 2: Check each option's structure
sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null correctly uses 'columns' list with 'name' and 'tests' list containing 'not_null'.Step 3: Identify errors in other options
sources: - name: app_data tables: - name: users tests: - column: email test: not_null uses wrong keys, sources: - name: app_data tables: - users: columns: - email: tests: - not_null has wrong nesting, sources: - name: app_data tables: - name: users columns: - email test: not_null uses 'test' instead of 'tests'.Final Answer:
sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null -> Option AQuick Check:
Tests under columns with 'tests' list = sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null [OK]
- Using 'test' instead of 'tests'
- Wrong nesting of columns and tests
- Misnaming keys like 'column' instead of 'name'
