0
0
dbtdata~15 mins

Configuring sources in YAML in dbt - Mechanics & Internals

Choose your learning style9 modes available
Overview - Configuring sources in YAML
What is it?
Configuring sources in YAML means defining where your raw data lives using a simple text format called YAML. In dbt, sources tell your project about external tables or files you want to use. This setup helps dbt understand and manage your data dependencies clearly. It’s like giving dbt a map to find your data before transforming it.
Why it matters
Without configuring sources, dbt wouldn’t know where to find the original data to work with. This would make data transformations unreliable and hard to maintain. By defining sources, you create a clear, reusable, and documented connection to your raw data, which improves data quality and team collaboration.
Where it fits
Before learning this, you should understand basic YAML syntax and dbt project structure. After mastering source configuration, you can move on to writing models that transform data and testing data quality using dbt.
Mental Model
Core Idea
Configuring sources in YAML is like giving dbt a clear address book to find and trust your raw data before transforming it.
Think of it like...
Imagine you want to bake a cake. Configuring sources is like writing down the exact grocery store and aisle where you get your ingredients. Without this, you might waste time searching or use wrong ingredients.
┌───────────────┐
│  dbt Project  │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  sources.yml  │
│ - name: raw_data_source
│   tables:
│    - name: customers
│    - name: orders
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Raw Data Tables│
│ customers     │
│ orders        │
└───────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding YAML Basics
🤔
Concept: Learn the simple structure and syntax of YAML used to write configuration files.
YAML is a human-friendly way to write data. It uses indentation to show hierarchy. For example: sources: - name: my_source tables: - name: my_table This means 'sources' is a list with one item named 'my_source' which has a list of tables including 'my_table'.
Result
You can read and write YAML files that dbt uses to configure sources.
Understanding YAML is essential because dbt uses it to define sources clearly and simply.
2
FoundationWhat Are Sources in dbt?
🤔
Concept: Sources tell dbt about external raw data tables before any transformation.
In dbt, a source is a reference to a table or file outside your dbt models. You define sources in YAML files under the 'sources' key. Each source has a name and a list of tables it contains. This helps dbt know where to find raw data.
Result
You can create a YAML file that lists your raw data tables as sources.
Knowing sources separates raw data from transformed data, making your project organized and reliable.
3
IntermediateWriting a Source Configuration File
🤔Before reading on: do you think a source config needs only table names, or also database and schema info? Commit to your answer.
Concept: Source configuration includes database, schema, and table names to fully locate data.
A typical source config looks like this: sources: - name: raw_data database: analytics_db schema: public tables: - name: customers - name: orders This tells dbt the exact place of each table.
Result
dbt knows exactly where to find each source table in your database.
Including database and schema ensures dbt connects to the right place, avoiding confusion in complex environments.
4
IntermediateReferencing Sources in Models
🤔Before reading on: do you think you write raw SQL table names directly or use a special function to refer to sources in dbt models? Commit to your answer.
Concept: dbt uses the source() function to refer to configured sources in SQL models.
Instead of writing raw table names, you use source() like this: select * from {{ source('raw_data', 'customers') }} This tells dbt to use the 'customers' table from the 'raw_data' source defined in YAML.
Result
Your models dynamically link to the correct source tables, improving maintainability.
Using source() creates a clear connection between your models and source configs, making refactoring safer.
5
IntermediateAdding Freshness and Description Metadata
🤔
Concept: You can add metadata like freshness checks and descriptions to sources for better data quality and documentation.
In your YAML, add freshness and description: sources: - name: raw_data description: 'Raw data from production' tables: - name: customers description: 'Customer details' freshness: warn_after: count: 12 period: hour error_after: count: 24 period: hour This helps dbt check if data is updated on time and documents it.
Result
dbt can warn or error if source data is stale, and your team understands data purpose better.
Metadata improves trust and communication about your data sources.
6
AdvancedUsing Source Configs for Testing and Documentation
🤔Before reading on: do you think source configs only help find data, or can they also be used for tests and docs? Commit to your answer.
Concept: Source configurations enable automated tests and documentation generation in dbt.
You can define tests on sources like uniqueness or not null: sources: - name: raw_data tables: - name: customers tests: - unique - not_null Running dbt test checks these automatically. Also, dbt docs generate pages from source descriptions.
Result
Your raw data is automatically checked for quality and well documented.
Leveraging source configs for tests and docs raises data reliability and team confidence.
7
ExpertManaging Multiple Environments with Source Overrides
🤔Before reading on: do you think source configs are fixed, or can they change per environment? Commit to your answer.
Concept: dbt allows overriding source configurations per environment to handle different databases or schemas.
You can use 'vars' or 'target' in dbt to change source configs dynamically: # in dbt_project.yml vars: source_schema: 'dev_schema' # in sources.yml sources: - name: raw_data schema: '{{ var('source_schema') }}' This lets you run the same project in dev, test, or prod with different source locations.
Result
Your dbt project adapts to multiple environments without changing source YAML files manually.
Dynamic source configs enable smooth deployment workflows and reduce errors across environments.
Under the Hood
When dbt runs, it reads the YAML source files first to build a catalog of raw data tables with their full database and schema paths. The source() function in models uses this catalog to generate correct SQL references. Metadata like freshness and tests are stored and executed during dbt runs to validate data health. This separation allows dbt to manage dependencies and documentation systematically.
Why designed this way?
dbt was designed to separate raw data definitions from transformations to improve clarity and maintainability. Using YAML for sources provides a simple, readable format that is easy to version control. The design supports modularity, allowing teams to update source info without touching transformation code, and supports automation of tests and docs.
┌───────────────┐
│ sources.yml   │
│ (YAML config) │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ dbt Catalog   │
│ (source info) │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ SQL Models    │
│ source() refs │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Database      │
│ Raw Tables    │
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Do you think source configurations automatically create tables in your database? Commit to yes or no.
Common Belief:Configuring sources in YAML creates the actual tables in the database.
Tap to reveal reality
Reality:Source configs only tell dbt where existing tables are; they do not create or modify tables.
Why it matters:Thinking sources create tables can lead to confusion and errors when tables are missing or not updated.
Quick: Do you think you can reference a source table in dbt without defining it in YAML? Commit to yes or no.
Common Belief:You can use any table name directly in models without defining it as a source in YAML.
Tap to reveal reality
Reality:To use source() function and get benefits like testing and docs, the source must be defined in YAML first.
Why it matters:Skipping source definitions loses automation benefits and can cause broken references.
Quick: Do you think source freshness checks run automatically every time you run dbt models? Commit to yes or no.
Common Belief:Source freshness checks run automatically with every dbt run.
Tap to reveal reality
Reality:Freshness checks run only when you explicitly run 'dbt source freshness'.
Why it matters:Assuming automatic freshness checks can cause unnoticed stale data issues.
Quick: Do you think source configurations are fixed and cannot be changed per environment? Commit to yes or no.
Common Belief:Source YAML files are static and cannot adapt to different environments like dev or prod.
Tap to reveal reality
Reality:Source configs can use variables and Jinja templating to change per environment dynamically.
Why it matters:Not knowing this limits flexibility and causes manual errors when deploying across environments.
Expert Zone
1
Source configurations can include quoting rules to handle case sensitivity or reserved words in different databases.
2
You can define multiple sources in one YAML file or split them across files for better organization in large projects.
3
Using source freshness metadata effectively requires understanding your data update patterns and scheduling dbt runs accordingly.
When NOT to use
If your data source is not a table but an API or streaming data, YAML source configs in dbt are not suitable. Instead, use external tools or custom integrations. Also, for very simple projects with no raw data dependencies, explicit source configs may be unnecessary.
Production Patterns
In production, teams use source configs to enforce data contracts, run automated tests on raw data, and generate documentation for data consumers. They also use environment variables and variables in YAML to manage multiple deployment targets seamlessly.
Connections
Data Cataloging
Configuring sources in YAML builds a simple form of data cataloging by documenting raw data locations and metadata.
Understanding source configs helps grasp how data catalogs organize and document data assets for discovery and governance.
Infrastructure as Code (IaC)
Both use declarative configuration files (like YAML) to define infrastructure or data setups.
Knowing source YAML configs in dbt parallels IaC concepts, showing how declarative files improve reproducibility and version control.
Supply Chain Management
Source configuration is like managing suppliers and raw materials before manufacturing products.
Seeing data sources as suppliers clarifies the importance of tracking and validating inputs before processing.
Common Pitfalls
#1Forgetting to define a source in YAML but trying to use source() in models.
Wrong approach:select * from {{ source('raw_data', 'missing_table') }}
Correct approach:Define 'missing_table' under 'raw_data' in sources.yml before referencing it.
Root cause:Assuming source() works without prior YAML definition causes runtime errors.
#2Hardcoding database and schema names in SQL models instead of using source configs.
Wrong approach:select * from analytics_db.public.customers
Correct approach:select * from {{ source('raw_data', 'customers') }}
Root cause:Hardcoding breaks portability and maintainability across environments.
#3Misindenting YAML causing dbt to fail parsing source configs.
Wrong approach:sources: - name: raw_data tables: - name: customers
Correct approach:sources: - name: raw_data tables: - name: customers
Root cause:YAML is sensitive to indentation; incorrect spacing breaks structure.
Key Takeaways
Configuring sources in YAML tells dbt where raw data lives, separating raw inputs from transformations.
Using source() in models links your SQL to these configurations, improving clarity and maintainability.
Adding metadata like freshness and tests in source configs boosts data quality and documentation.
Source configs can be dynamic per environment, enabling smooth deployments across dev, test, and prod.
Understanding source configuration is foundational for building reliable, scalable dbt projects.