dbtdata~15 mins

Configuring sources in YAML in dbt - Mechanics & Internals

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - Configuring sources in YAML

What is it?

Configuring sources in YAML means defining where your raw data lives using a simple text format called YAML. In dbt, sources tell your project about external tables or files you want to use. This setup helps dbt understand and manage your data dependencies clearly. It’s like giving dbt a map to find your data before transforming it.

Why it matters

Without configuring sources, dbt wouldn’t know where to find the original data to work with. This would make data transformations unreliable and hard to maintain. By defining sources, you create a clear, reusable, and documented connection to your raw data, which improves data quality and team collaboration.

Where it fits

Before learning this, you should understand basic YAML syntax and dbt project structure. After mastering source configuration, you can move on to writing models that transform data and testing data quality using dbt.

Mental Model

Core Idea

Configuring sources in YAML is like giving dbt a clear address book to find and trust your raw data before transforming it.

Think of it like...

Imagine you want to bake a cake. Configuring sources is like writing down the exact grocery store and aisle where you get your ingredients. Without this, you might waste time searching or use wrong ingredients.

┌───────────────┐
│  dbt Project  │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  sources.yml  │
│ - name: raw_data_source
│   tables:
│    - name: customers
│    - name: orders
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Raw Data Tables│
│ customers     │
│ orders        │
└───────────────┘

Build-Up - 7 Steps

FoundationUnderstanding YAML Basics

Concept: Learn the simple structure and syntax of YAML used to write configuration files.

YAML is a human-friendly way to write data. It uses indentation to show hierarchy. For example: sources: - name: my_source tables: - name: my_table This means 'sources' is a list with one item named 'my_source' which has a list of tables including 'my_table'.

Result

You can read and write YAML files that dbt uses to configure sources.

Understanding YAML is essential because dbt uses it to define sources clearly and simply.

FoundationWhat Are Sources in dbt?

IntermediateWriting a Source Configuration File

IntermediateReferencing Sources in Models

IntermediateAdding Freshness and Description Metadata

AdvancedUsing Source Configs for Testing and Documentation

ExpertManaging Multiple Environments with Source Overrides

Under the Hood

When dbt runs, it reads the YAML source files first to build a catalog of raw data tables with their full database and schema paths. The source() function in models uses this catalog to generate correct SQL references. Metadata like freshness and tests are stored and executed during dbt runs to validate data health. This separation allows dbt to manage dependencies and documentation systematically.

Why designed this way?

dbt was designed to separate raw data definitions from transformations to improve clarity and maintainability. Using YAML for sources provides a simple, readable format that is easy to version control. The design supports modularity, allowing teams to update source info without touching transformation code, and supports automation of tests and docs.

┌───────────────┐
│ sources.yml   │
│ (YAML config) │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ dbt Catalog   │
│ (source info) │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ SQL Models    │
│ source() refs │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Database      │
│ Raw Tables    │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Do you think source configurations automatically create tables in your database? Commit to yes or no.

Common Belief:Configuring sources in YAML creates the actual tables in the database.

Tap to reveal reality

Quick: Do you think you can reference a source table in dbt without defining it in YAML? Commit to yes or no.

Common Belief:You can use any table name directly in models without defining it as a source in YAML.

Tap to reveal reality

Quick: Do you think source freshness checks run automatically every time you run dbt models? Commit to yes or no.

Common Belief:Source freshness checks run automatically with every dbt run.

Tap to reveal reality

Quick: Do you think source configurations are fixed and cannot be changed per environment? Commit to yes or no.

Common Belief:Source YAML files are static and cannot adapt to different environments like dev or prod.

Tap to reveal reality

Expert Zone

Source configurations can include quoting rules to handle case sensitivity or reserved words in different databases.

You can define multiple sources in one YAML file or split them across files for better organization in large projects.

Using source freshness metadata effectively requires understanding your data update patterns and scheduling dbt runs accordingly.

When NOT to use

If your data source is not a table but an API or streaming data, YAML source configs in dbt are not suitable. Instead, use external tools or custom integrations. Also, for very simple projects with no raw data dependencies, explicit source configs may be unnecessary.

Production Patterns

In production, teams use source configs to enforce data contracts, run automated tests on raw data, and generate documentation for data consumers. They also use environment variables and variables in YAML to manage multiple deployment targets seamlessly.

Connections

Data Cataloging

Configuring sources in YAML builds a simple form of data cataloging by documenting raw data locations and metadata.

Understanding source configs helps grasp how data catalogs organize and document data assets for discovery and governance.

Infrastructure as Code (IaC)

Both use declarative configuration files (like YAML) to define infrastructure or data setups.

Knowing source YAML configs in dbt parallels IaC concepts, showing how declarative files improve reproducibility and version control.

Supply Chain Management

Source configuration is like managing suppliers and raw materials before manufacturing products.

Seeing data sources as suppliers clarifies the importance of tracking and validating inputs before processing.

Common Pitfalls

#1Forgetting to define a source in YAML but trying to use source() in models.

Wrong approach:select * from {{ source('raw_data', 'missing_table') }}

Correct approach:Define 'missing_table' under 'raw_data' in sources.yml before referencing it.

Root cause:Assuming source() works without prior YAML definition causes runtime errors.

#2Hardcoding database and schema names in SQL models instead of using source configs.

Wrong approach:select * from analytics_db.public.customers

Correct approach:select * from {{ source('raw_data', 'customers') }}

Root cause:Hardcoding breaks portability and maintainability across environments.

#3Misindenting YAML causing dbt to fail parsing source configs.

Wrong approach:sources: - name: raw_data tables: - name: customers

Correct approach:sources: - name: raw_data tables: - name: customers

Root cause:YAML is sensitive to indentation; incorrect spacing breaks structure.

Key Takeaways

Configuring sources in YAML tells dbt where raw data lives, separating raw inputs from transformations.

Using source() in models links your SQL to these configurations, improving clarity and maintainability.

Adding metadata like freshness and tests in source configs boosts data quality and documentation.

Source configs can be dynamic per environment, enabling smooth deployments across dev, test, and prod.

Understanding source configuration is foundational for building reliable, scalable dbt projects.

Practice

(1/5)

1. What is the main purpose of configuring sources in a dbt YAML file?

easy

A. To write SQL queries for data transformation

B. To tell dbt where to find raw data tables

C. To create dashboards for data visualization

D. To schedule dbt runs automatically

5. You want to add a test to ensure the 'email' column in the 'users' table source is never null. Which YAML snippet correctly adds this test?

hard

A. sources: - name: app_data tables: - name: users columns: - name: email tests: - not_null

B. sources: - name: app_data tables: - name: users tests: - column: email test: not_null

C. sources: - name: app_data tables: - users: columns: - email: tests: - not_null

D. sources: - name: app_data tables: - name: users columns: - email test: not_null

Configuring sources in YAML in dbt - Mechanics & Internals

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of source configuration

Step 2: Differentiate from other dbt tasks

Final Answer:

Quick Check:

Solution

Step 1: Recall correct YAML source structure

Step 2: Compare options to syntax

Final Answer:

Quick Check:

Solution

Step 1: Locate the 'loaded_at_field' key in YAML

Step 2: Identify the value assigned

Final Answer:

Quick Check:

Solution

Step 1: Understand dbt freshness period syntax

Step 2: Check the YAML periods

Step 3: Rule out other options

Final Answer:

Quick Check:

Solution

Step 1: Recall correct test syntax in source YAML

Step 2: Check each option's structure

Step 3: Identify errors in other options

Final Answer:

Quick Check: