Bird
Raised Fist0
dbtdata~15 mins

Configuring sources in YAML in dbt - Mechanics & Internals

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Configuring sources in YAML
What is it?
Configuring sources in YAML means defining where your raw data lives using a simple text format called YAML. In dbt, sources tell your project about external tables or files you want to use. This setup helps dbt understand and manage your data dependencies clearly. It’s like giving dbt a map to find your data before transforming it.
Why it matters
Without configuring sources, dbt wouldn’t know where to find the original data to work with. This would make data transformations unreliable and hard to maintain. By defining sources, you create a clear, reusable, and documented connection to your raw data, which improves data quality and team collaboration.
Where it fits
Before learning this, you should understand basic YAML syntax and dbt project structure. After mastering source configuration, you can move on to writing models that transform data and testing data quality using dbt.
Mental Model
Core Idea
Configuring sources in YAML is like giving dbt a clear address book to find and trust your raw data before transforming it.
Think of it like...
Imagine you want to bake a cake. Configuring sources is like writing down the exact grocery store and aisle where you get your ingredients. Without this, you might waste time searching or use wrong ingredients.
┌───────────────┐
│  dbt Project  │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  sources.yml  │
│ - name: raw_data_source
│   tables:
│    - name: customers
│    - name: orders
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Raw Data Tables│
│ customers     │
│ orders        │
└───────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding YAML Basics
🤔
Concept: Learn the simple structure and syntax of YAML used to write configuration files.
YAML is a human-friendly way to write data. It uses indentation to show hierarchy. For example: sources: - name: my_source tables: - name: my_table This means 'sources' is a list with one item named 'my_source' which has a list of tables including 'my_table'.
Result
You can read and write YAML files that dbt uses to configure sources.
Understanding YAML is essential because dbt uses it to define sources clearly and simply.
2
FoundationWhat Are Sources in dbt?
🤔
Concept: Sources tell dbt about external raw data tables before any transformation.
In dbt, a source is a reference to a table or file outside your dbt models. You define sources in YAML files under the 'sources' key. Each source has a name and a list of tables it contains. This helps dbt know where to find raw data.
Result
You can create a YAML file that lists your raw data tables as sources.
Knowing sources separates raw data from transformed data, making your project organized and reliable.
3
IntermediateWriting a Source Configuration File
🤔Before reading on: do you think a source config needs only table names, or also database and schema info? Commit to your answer.
Concept: Source configuration includes database, schema, and table names to fully locate data.
A typical source config looks like this: sources: - name: raw_data database: analytics_db schema: public tables: - name: customers - name: orders This tells dbt the exact place of each table.
Result
dbt knows exactly where to find each source table in your database.
Including database and schema ensures dbt connects to the right place, avoiding confusion in complex environments.
4
IntermediateReferencing Sources in Models
🤔Before reading on: do you think you write raw SQL table names directly or use a special function to refer to sources in dbt models? Commit to your answer.
Concept: dbt uses the source() function to refer to configured sources in SQL models.
Instead of writing raw table names, you use source() like this: select * from {{ source('raw_data', 'customers') }} This tells dbt to use the 'customers' table from the 'raw_data' source defined in YAML.
Result
Your models dynamically link to the correct source tables, improving maintainability.
Using source() creates a clear connection between your models and source configs, making refactoring safer.
5
IntermediateAdding Freshness and Description Metadata
🤔
Concept: You can add metadata like freshness checks and descriptions to sources for better data quality and documentation.
In your YAML, add freshness and description: sources: - name: raw_data description: 'Raw data from production' tables: - name: customers description: 'Customer details' freshness: warn_after: count: 12 period: hour error_after: count: 24 period: hour This helps dbt check if data is updated on time and documents it.
Result
dbt can warn or error if source data is stale, and your team understands data purpose better.
Metadata improves trust and communication about your data sources.
6
AdvancedUsing Source Configs for Testing and Documentation
🤔Before reading on: do you think source configs only help find data, or can they also be used for tests and docs? Commit to your answer.
Concept: Source configurations enable automated tests and documentation generation in dbt.
You can define tests on sources like uniqueness or not null: sources: - name: raw_data tables: - name: customers tests: - unique - not_null Running dbt test checks these automatically. Also, dbt docs generate pages from source descriptions.
Result
Your raw data is automatically checked for quality and well documented.
Leveraging source configs for tests and docs raises data reliability and team confidence.
7
ExpertManaging Multiple Environments with Source Overrides
🤔Before reading on: do you think source configs are fixed, or can they change per environment? Commit to your answer.
Concept: dbt allows overriding source configurations per environment to handle different databases or schemas.
You can use 'vars' or 'target' in dbt to change source configs dynamically: # in dbt_project.yml vars: source_schema: 'dev_schema' # in sources.yml sources: - name: raw_data schema: '{{ var('source_schema') }}' This lets you run the same project in dev, test, or prod with different source locations.
Result
Your dbt project adapts to multiple environments without changing source YAML files manually.
Dynamic source configs enable smooth deployment workflows and reduce errors across environments.
Under the Hood
When dbt runs, it reads the YAML source files first to build a catalog of raw data tables with their full database and schema paths. The source() function in models uses this catalog to generate correct SQL references. Metadata like freshness and tests are stored and executed during dbt runs to validate data health. This separation allows dbt to manage dependencies and documentation systematically.
Why designed this way?
dbt was designed to separate raw data definitions from transformations to improve clarity and maintainability. Using YAML for sources provides a simple, readable format that is easy to version control. The design supports modularity, allowing teams to update source info without touching transformation code, and supports automation of tests and docs.
┌───────────────┐
│ sources.yml   │
│ (YAML config) │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ dbt Catalog   │
│ (source info) │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ SQL Models    │
│ source() refs │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Database      │
│ Raw Tables    │
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Do you think source configurations automatically create tables in your database? Commit to yes or no.
Common Belief:Configuring sources in YAML creates the actual tables in the database.
Tap to reveal reality
Reality:Source configs only tell dbt where existing tables are; they do not create or modify tables.
Why it matters:Thinking sources create tables can lead to confusion and errors when tables are missing or not updated.
Quick: Do you think you can reference a source table in dbt without defining it in YAML? Commit to yes or no.
Common Belief:You can use any table name directly in models without defining it as a source in YAML.
Tap to reveal reality
Reality:To use source() function and get benefits like testing and docs, the source must be defined in YAML first.
Why it matters:Skipping source definitions loses automation benefits and can cause broken references.
Quick: Do you think source freshness checks run automatically every time you run dbt models? Commit to yes or no.
Common Belief:Source freshness checks run automatically with every dbt run.
Tap to reveal reality
Reality:Freshness checks run only when you explicitly run 'dbt source freshness'.
Why it matters:Assuming automatic freshness checks can cause unnoticed stale data issues.
Quick: Do you think source configurations are fixed and cannot be changed per environment? Commit to yes or no.
Common Belief:Source YAML files are static and cannot adapt to different environments like dev or prod.
Tap to reveal reality
Reality:Source configs can use variables and Jinja templating to change per environment dynamically.
Why it matters:Not knowing this limits flexibility and causes manual errors when deploying across environments.
Expert Zone
1
Source configurations can include quoting rules to handle case sensitivity or reserved words in different databases.
2
You can define multiple sources in one YAML file or split them across files for better organization in large projects.
3
Using source freshness metadata effectively requires understanding your data update patterns and scheduling dbt runs accordingly.
When NOT to use
If your data source is not a table but an API or streaming data, YAML source configs in dbt are not suitable. Instead, use external tools or custom integrations. Also, for very simple projects with no raw data dependencies, explicit source configs may be unnecessary.
Production Patterns
In production, teams use source configs to enforce data contracts, run automated tests on raw data, and generate documentation for data consumers. They also use environment variables and variables in YAML to manage multiple deployment targets seamlessly.
Connections
Data Cataloging
Configuring sources in YAML builds a simple form of data cataloging by documenting raw data locations and metadata.
Understanding source configs helps grasp how data catalogs organize and document data assets for discovery and governance.
Infrastructure as Code (IaC)
Both use declarative configuration files (like YAML) to define infrastructure or data setups.
Knowing source YAML configs in dbt parallels IaC concepts, showing how declarative files improve reproducibility and version control.
Supply Chain Management
Source configuration is like managing suppliers and raw materials before manufacturing products.
Seeing data sources as suppliers clarifies the importance of tracking and validating inputs before processing.
Common Pitfalls
#1Forgetting to define a source in YAML but trying to use source() in models.
Wrong approach:select * from {{ source('raw_data', 'missing_table') }}
Correct approach:Define 'missing_table' under 'raw_data' in sources.yml before referencing it.
Root cause:Assuming source() works without prior YAML definition causes runtime errors.
#2Hardcoding database and schema names in SQL models instead of using source configs.
Wrong approach:select * from analytics_db.public.customers
Correct approach:select * from {{ source('raw_data', 'customers') }}
Root cause:Hardcoding breaks portability and maintainability across environments.
#3Misindenting YAML causing dbt to fail parsing source configs.
Wrong approach:sources: - name: raw_data tables: - name: customers
Correct approach:sources: - name: raw_data tables: - name: customers
Root cause:YAML is sensitive to indentation; incorrect spacing breaks structure.
Key Takeaways
Configuring sources in YAML tells dbt where raw data lives, separating raw inputs from transformations.
Using source() in models links your SQL to these configurations, improving clarity and maintainability.
Adding metadata like freshness and tests in source configs boosts data quality and documentation.
Source configs can be dynamic per environment, enabling smooth deployments across dev, test, and prod.
Understanding source configuration is foundational for building reliable, scalable dbt projects.

Practice

(1/5)
1. What is the main purpose of configuring sources in a dbt YAML file?
easy
A. To write SQL queries for data transformation
B. To tell dbt where to find raw data tables
C. To create dashboards for data visualization
D. To schedule dbt runs automatically

Solution

  1. Step 1: Understand the role of source configuration

    Source configuration in dbt YAML files defines where raw data tables are located in the database.
  2. Step 2: Differentiate from other dbt tasks

    Writing SQL queries and scheduling runs are done elsewhere, not in source YAML files.
  3. Final Answer:

    To tell dbt where to find raw data tables -> Option B
  4. Quick Check:

    Source config = raw data location [OK]
Hint: Sources define raw table locations in YAML [OK]
Common Mistakes:
  • Confusing source config with SQL model code
  • Thinking sources schedule runs
  • Assuming sources create visualizations
2. Which of the following is the correct syntax to define a source in a dbt YAML file?
easy
A. source: name: raw_data table: - customers
B. sources: name: raw_data tables: - customers
C. sources: - name: raw_data tables: - name: customers
D. source: - raw_data: tables: - customers

Solution

  1. Step 1: Recall correct YAML source structure

    The correct syntax uses 'sources' as a list with 'name' and nested 'tables' list, each with a 'name'.
  2. Step 2: Compare options to syntax

    sources: - name: raw_data tables: - name: customers matches the correct indentation and keys exactly.
  3. Final Answer:

    sources: - name: raw_data tables: - name: customers -> Option C
  4. Quick Check:

    Correct YAML keys and indentation = sources: - name: raw_data tables: - name: customers [OK]
Hint: Look for 'sources' list with 'name' and 'tables' keys [OK]
Common Mistakes:
  • Using singular 'source' instead of 'sources'
  • Missing 'name' key for tables
  • Incorrect indentation breaking YAML structure
3. Given this YAML snippet, what is the value of the 'loaded_at_field' for the source 'sales_data'?
sources:
  - name: sales_data
    tables:
      - name: transactions
        loaded_at_field: transaction_date
medium
A. transaction_date
B. transactions
C. loaded_at_field
D. sales_data

Solution

  1. Step 1: Locate the 'loaded_at_field' key in YAML

    It is nested under the 'transactions' table inside the 'sales_data' source.
  2. Step 2: Identify the value assigned

    The value assigned to 'loaded_at_field' is 'transaction_date'.
  3. Final Answer:

    transaction_date -> Option A
  4. Quick Check:

    loaded_at_field value = transaction_date [OK]
Hint: Find 'loaded_at_field' key's value under table [OK]
Common Mistakes:
  • Confusing source name with field value
  • Picking table name instead of field value
  • Misreading YAML indentation levels
4. Identify the error in this source configuration YAML:
sources:
  - name: marketing_data
    tables:
      - name: leads
        freshness:
          warn_after:
            count: 12
            period: hours
          error_after:
            count: 1
            period: days
medium
A. 'warn_after' and 'error_after' counts are reversed
B. The indentation under 'freshness' is incorrect
C. The 'error_after' period should be less than 'warn_after'
D. The 'period' values must be singular strings

Solution

  1. Step 1: Understand dbt freshness period syntax

    dbt freshness requires singular 'period' values like 'hour', 'day', 'minute'. Plural forms ('hours', 'days') are invalid and cause errors.
  2. Step 2: Check the YAML periods

    'period: hours' and 'period: days' use plural, which dbt does not recognize.
  3. Step 3: Rule out other options

    A: Counts logical (12 hours warn before 1 day/24 hours error). B: Indentation correct. C: Incorrect--error_after time must be *longer* than warn_after.
  4. Final Answer:

    The 'period' values must be singular strings -> Option D
  5. Quick Check:

    period: hour/day (singular only) [OK]
Hint: dbt freshness periods must be singular (hour, day) [OK]
Common Mistakes:
  • Using plural periods ('hours', 'days')
  • Incorrect YAML indentation
  • Thinking error_after time should be shorter than warn_after
5. You want to add a test to ensure the 'email' column in the 'users' table source is never null. Which YAML snippet correctly adds this test?
hard
A. sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null
B. sources: - name: app_data tables: - name: users tests: - column: email test: not_null
C. sources: - name: app_data tables: - users: columns: - email: tests: - not_null
D. sources: - name: app_data tables: - name: users columns: - email test: not_null

Solution

  1. Step 1: Recall correct test syntax in source YAML

    Tests are added under 'columns' with 'name' and a 'tests' list containing test names.
  2. Step 2: Check each option's structure

    sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null correctly uses 'columns' list with 'name' and 'tests' list containing 'not_null'.
  3. Step 3: Identify errors in other options

    sources: - name: app_data tables: - name: users tests: - column: email test: not_null uses wrong keys, sources: - name: app_data tables: - users: columns: - email: tests: - not_null has wrong nesting, sources: - name: app_data tables: - name: users columns: - email test: not_null uses 'test' instead of 'tests'.
  4. Final Answer:

    sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null -> Option A
  5. Quick Check:

    Tests under columns with 'tests' list = sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null [OK]
Hint: Tests go under columns with 'tests' list [OK]
Common Mistakes:
  • Using 'test' instead of 'tests'
  • Wrong nesting of columns and tests
  • Misnaming keys like 'column' instead of 'name'