dbtdata~15 mins

source() function for raw tables in dbt - Deep Dive

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - source() function for raw tables

What is it?

The source() function in dbt is a way to refer to raw tables that exist outside of your dbt project. It helps you tell dbt where your original data lives before you transform it. This function creates a clear link between your raw data and the models you build on top of it. It also helps with documentation and testing of these raw tables.

Why it matters

Without the source() function, it would be hard to track where your raw data comes from and how it flows through your transformations. This can lead to confusion, errors, and difficulty in debugging. Using source() makes your data pipeline more transparent and reliable, which is crucial for making trustworthy decisions based on data.

Where it fits

Before learning source(), you should understand basic dbt models and how dbt runs SQL transformations. After mastering source(), you can learn about dbt tests, documentation, and advanced data lineage tracking.

Mental Model

Core Idea

source() acts like a named pointer that connects your dbt models to the original raw tables outside your project.

Think of it like...

Imagine source() as a street address written on a package label. It tells the delivery person exactly where to pick up the package (raw data) before it gets processed and sent out (transformed).

┌───────────────┐       ┌───────────────┐
│ Raw Data Table│──────▶│ source() in dbt│
└───────────────┘       └───────────────┘
                              │
                              ▼
                      ┌───────────────┐
                      │ Transformed   │
                      │ dbt Models    │
                      └───────────────┘

Build-Up - 7 Steps

FoundationUnderstanding Raw Tables in dbt

Concept: Raw tables are the original data tables that exist in your database before any transformations.

Raw tables are usually created by data ingestion processes and contain unprocessed data. In dbt, these tables are not created by your project but are the starting point for your transformations. You need to know their names and locations to use them.

Result

You recognize raw tables as the source of truth for your data transformations.

Understanding raw tables is essential because they are the foundation of your entire data pipeline.

FoundationBasic dbt Model References

IntermediateIntroducing source() for Raw Tables

IntermediateDefining Sources in YAML Files

IntermediateUsing source() in SQL Models

AdvancedBenefits of source() for Testing and Documentation

ExpertAdvanced Source Configuration and Overriding

Under the Hood

The source() function is a Jinja macro in dbt that resolves to the fully qualified name of a raw table based on the source and table names defined in YAML. At compile time, dbt replaces source() calls with the correct database schema and table names, ensuring SQL queries point to the right raw data. This linking also enables dbt to track dependencies and generate metadata for documentation and testing.

Why designed this way?

source() was designed to separate raw data definitions from transformation logic, improving clarity and maintainability. Before source(), users hardcoded raw table names, which was error-prone and made lineage tracking difficult. The YAML-based source definitions allow centralized metadata management and enable dbt to automate testing and documentation, which were manual and inconsistent before.

┌───────────────┐
│ YAML Source   │
│ Definitions   │
│ (source_name, │
│ schema, table)│
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ dbt Compile   │
│ (replace      │
│ source() with │
│ full table    │
│ path)         │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ SQL Query     │
│ with raw      │
│ table refs    │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: do you think source() creates new tables in your database? Commit yes or no.

Common Belief:source() creates or copies raw tables inside dbt.

Tap to reveal reality

Quick: do you think source() can be used without defining sources in YAML? Commit yes or no.

Common Belief:You can use source() without declaring sources in YAML files.

Tap to reveal reality

Quick: do you think source() references are fixed and cannot change per environment? Commit yes or no.

Common Belief:source() references are static and cannot be overridden per environment.

Tap to reveal reality

Quick: do you think source() automatically tests raw data quality? Commit yes or no.

Common Belief:Using source() automatically guarantees raw data quality without extra setup.

Tap to reveal reality

Expert Zone

Source freshness checks depend on metadata in the source YAML and require scheduling in dbt Cloud or orchestration tools to be effective.

Using source() improves lineage tracking but requires consistent naming conventions and source definitions to avoid confusion in large projects.

Overriding sources per environment can introduce subtle bugs if not carefully managed, especially when schemas differ between dev and prod.

When NOT to use

Avoid using source() when working with ephemeral or temporary tables created only during a dbt run; use ref() instead. Also, if your raw data is not managed in a database but in files or APIs, source() is not applicable; use external tools or custom macros.

Production Patterns

In production, teams define all raw data sources in YAML with detailed metadata and tests. They use source() in models to ensure clear lineage and run automated freshness checks. Overriding sources per environment allows seamless deployment from development to production without changing SQL code.

Connections

Data Lineage

source() builds the foundation for tracking data lineage by linking raw tables to transformations.

Understanding source() helps grasp how data flows and dependencies are tracked in complex pipelines.

Software Dependency Injection

source() acts like dependency injection by decoupling raw data definitions from transformation logic.

Knowing this connection clarifies why separating configuration (YAML) from code (SQL) improves flexibility and maintainability.

Supply Chain Management

source() is like identifying raw material suppliers in a supply chain before manufacturing products.

Recognizing this analogy helps appreciate the importance of clear source definitions for quality control and traceability.

Common Pitfalls

#1Referencing raw tables directly in SQL without source()

Wrong approach:select * from raw_schema.users

Correct approach:select * from {{ source('raw_data', 'users') }}

Root cause:Not using source() loses metadata benefits and breaks dbt's dependency tracking.

#2Using source() without defining the source in YAML

Wrong approach:select * from {{ source('unknown_source', 'users') }}

Correct approach:Define the source in YAML before using source() in SQL.

Root cause:Missing source definitions cause compilation errors.

#3Hardcoding schema names inside source YAML without environment overrides

Wrong approach:schema: raw_schema_prod (fixed in YAML)

Correct approach:Use variables or profiles to override schema per environment.

Root cause:Fixed schemas reduce flexibility and cause deployment issues.

Key Takeaways

The source() function in dbt connects your models to raw tables by referencing them through YAML-defined sources.

Using source() improves clarity, documentation, testing, and lineage tracking in your data pipeline.

Source definitions must be declared in YAML files before using source() in SQL models.

Advanced source features like freshness checks and environment overrides help build robust production pipelines.

Avoid common mistakes like hardcoding raw table names or skipping source definitions to maintain project health.

Practice

(1/5)

1. What is the main purpose of the source() function in dbt?

easy

A. To create new tables in the database

B. To run Python scripts inside dbt models

C. To delete raw tables from the database

D. To reference raw tables defined in the sources.yml file

source() function for raw tables in dbt - Deep Dive

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of `source()`

Step 2: Differentiate from other dbt functions

Final Answer:

Quick Check:

Solution

Step 1: Recall `source()` function syntax

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the `source()` usage

Step 2: Analyze the SQL query

Final Answer:

Quick Check:

Solution

Step 1: Check dbt Jinja syntax

Step 2: Understand the error message

Final Answer:

Quick Check:

Solution

Step 1: Use correct `source()` syntax with Jinja braces

Step 2: Use correct date format in SQL condition

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of source()

Step 2: Differentiate from other dbt functions

Final Answer:

Quick Check:

Solution

Step 1: Recall source() function syntax

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the source() usage

Step 2: Analyze the SQL query

Final Answer:

Quick Check:

Solution

Step 1: Check dbt Jinja syntax

Step 2: Understand the error message

Final Answer:

Quick Check:

Solution

Step 1: Use correct source() syntax with Jinja braces

Step 2: Use correct date format in SQL condition

Final Answer:

Quick Check:

Step 1: Understand the role of `source()`

Step 1: Recall `source()` function syntax

Step 1: Understand the `source()` usage

Step 1: Use correct `source()` syntax with Jinja braces