0
0
dbtdata~15 mins

source() function for raw tables in dbt - Deep Dive

Choose your learning style9 modes available
Overview - source() function for raw tables
What is it?
The source() function in dbt is a way to refer to raw tables that exist outside of your dbt project. It helps you tell dbt where your original data lives before you transform it. This function creates a clear link between your raw data and the models you build on top of it. It also helps with documentation and testing of these raw tables.
Why it matters
Without the source() function, it would be hard to track where your raw data comes from and how it flows through your transformations. This can lead to confusion, errors, and difficulty in debugging. Using source() makes your data pipeline more transparent and reliable, which is crucial for making trustworthy decisions based on data.
Where it fits
Before learning source(), you should understand basic dbt models and how dbt runs SQL transformations. After mastering source(), you can learn about dbt tests, documentation, and advanced data lineage tracking.
Mental Model
Core Idea
source() acts like a named pointer that connects your dbt models to the original raw tables outside your project.
Think of it like...
Imagine source() as a street address written on a package label. It tells the delivery person exactly where to pick up the package (raw data) before it gets processed and sent out (transformed).
┌───────────────┐       ┌───────────────┐
│ Raw Data Table│──────▶│ source() in dbt│
└───────────────┘       └───────────────┘
                              │
                              ▼
                      ┌───────────────┐
                      │ Transformed   │
                      │ dbt Models    │
                      └───────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding Raw Tables in dbt
🤔
Concept: Raw tables are the original data tables that exist in your database before any transformations.
Raw tables are usually created by data ingestion processes and contain unprocessed data. In dbt, these tables are not created by your project but are the starting point for your transformations. You need to know their names and locations to use them.
Result
You recognize raw tables as the source of truth for your data transformations.
Understanding raw tables is essential because they are the foundation of your entire data pipeline.
2
FoundationBasic dbt Model References
🤔
Concept: dbt models are SQL files that transform data, and you refer to them using ref() inside your project.
In dbt, you use ref('model_name') to refer to other models within your project. This creates dependencies and ensures models build in the right order. However, ref() only works for models created by dbt, not for raw tables.
Result
You can link your models together but cannot yet link to raw tables.
Knowing the limits of ref() prepares you to understand why source() is needed.
3
IntermediateIntroducing source() for Raw Tables
🤔Before reading on: do you think ref() can be used to refer to raw tables? Commit to your answer.
Concept: source() is a dbt function that lets you refer to raw tables outside your project by specifying their source name and table name.
You define sources in your dbt project YAML files with a name and schema. Then, in your SQL models, you use source('source_name', 'table_name') to refer to these raw tables. This tells dbt exactly where to find the raw data.
Result
You can now safely and clearly reference raw tables in your transformations.
Understanding source() bridges the gap between raw data and your dbt models, making your pipeline clearer.
4
IntermediateDefining Sources in YAML Files
🤔Before reading on: do you think source definitions are written in SQL or YAML files? Commit to your answer.
Concept: Sources are declared in YAML files to describe raw tables and their metadata.
In a YAML file, you write a source block with a name, schema, and tables. For example: sources: - name: raw_data schema: raw_schema tables: - name: users - name: orders This setup allows dbt to understand where raw tables live.
Result
Your project knows about raw tables and can use source() to refer to them.
Separating source definitions from SQL keeps your project organized and makes metadata management easier.
5
IntermediateUsing source() in SQL Models
🤔
Concept: You use source() inside your SQL models to query raw tables by their source and table names.
In your model SQL file, you write: select * from {{ source('raw_data', 'users') }} This tells dbt to replace the source() call with the full database path to the raw users table.
Result
Your model queries the raw users table correctly and builds on top of it.
Using source() in SQL makes your code readable and maintainable by clearly showing data origins.
6
AdvancedBenefits of source() for Testing and Documentation
🤔Before reading on: do you think source() affects only SQL code or also dbt tests and docs? Commit to your answer.
Concept: source() enables dbt to automatically generate documentation and run tests on raw tables.
When you define sources, dbt can create docs pages showing raw table details and run tests like uniqueness or not-null on source columns. This helps catch data quality issues early and improves transparency.
Result
Your project has better data quality checks and clear documentation for raw data.
Knowing that source() supports testing and docs shows its role beyond just SQL references.
7
ExpertAdvanced Source Configuration and Overriding
🤔Before reading on: do you think source() references can be dynamically changed per environment? Commit to your answer.
Concept: dbt allows overriding source configurations per environment and supports freshness checks to monitor raw data updates.
You can configure sources with freshness criteria to alert if raw data is stale. Also, source paths can be overridden in different environments (dev, prod) using dbt's profiles or variables. This flexibility helps manage complex pipelines.
Result
Your project adapts to different environments and monitors raw data health automatically.
Understanding source() advanced features helps build robust, production-ready data pipelines.
Under the Hood
The source() function is a Jinja macro in dbt that resolves to the fully qualified name of a raw table based on the source and table names defined in YAML. At compile time, dbt replaces source() calls with the correct database schema and table names, ensuring SQL queries point to the right raw data. This linking also enables dbt to track dependencies and generate metadata for documentation and testing.
Why designed this way?
source() was designed to separate raw data definitions from transformation logic, improving clarity and maintainability. Before source(), users hardcoded raw table names, which was error-prone and made lineage tracking difficult. The YAML-based source definitions allow centralized metadata management and enable dbt to automate testing and documentation, which were manual and inconsistent before.
┌───────────────┐
│ YAML Source   │
│ Definitions   │
│ (source_name, │
│ schema, table)│
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ dbt Compile   │
│ (replace      │
│ source() with │
│ full table    │
│ path)         │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ SQL Query     │
│ with raw      │
│ table refs    │
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: do you think source() creates new tables in your database? Commit yes or no.
Common Belief:source() creates or copies raw tables inside dbt.
Tap to reveal reality
Reality:source() only references existing raw tables; it does not create or modify them.
Why it matters:Thinking source() creates tables can lead to confusion about data lineage and cause errors when expecting new tables to appear.
Quick: do you think source() can be used without defining sources in YAML? Commit yes or no.
Common Belief:You can use source() without declaring sources in YAML files.
Tap to reveal reality
Reality:source() requires source definitions in YAML to work; otherwise, dbt will error during compilation.
Why it matters:Skipping YAML definitions breaks your project and prevents dbt from tracking raw data metadata.
Quick: do you think source() references are fixed and cannot change per environment? Commit yes or no.
Common Belief:source() references are static and cannot be overridden per environment.
Tap to reveal reality
Reality:dbt allows overriding source configurations per environment, enabling flexible raw table references.
Why it matters:Not knowing this limits your ability to manage multiple environments and can cause deployment issues.
Quick: do you think source() automatically tests raw data quality? Commit yes or no.
Common Belief:Using source() automatically guarantees raw data quality without extra setup.
Tap to reveal reality
Reality:source() enables testing but you must explicitly define tests on sources to check data quality.
Why it matters:Assuming automatic quality leads to missed errors and unreliable data pipelines.
Expert Zone
1
Source freshness checks depend on metadata in the source YAML and require scheduling in dbt Cloud or orchestration tools to be effective.
2
Using source() improves lineage tracking but requires consistent naming conventions and source definitions to avoid confusion in large projects.
3
Overriding sources per environment can introduce subtle bugs if not carefully managed, especially when schemas differ between dev and prod.
When NOT to use
Avoid using source() when working with ephemeral or temporary tables created only during a dbt run; use ref() instead. Also, if your raw data is not managed in a database but in files or APIs, source() is not applicable; use external tools or custom macros.
Production Patterns
In production, teams define all raw data sources in YAML with detailed metadata and tests. They use source() in models to ensure clear lineage and run automated freshness checks. Overriding sources per environment allows seamless deployment from development to production without changing SQL code.
Connections
Data Lineage
source() builds the foundation for tracking data lineage by linking raw tables to transformations.
Understanding source() helps grasp how data flows and dependencies are tracked in complex pipelines.
Software Dependency Injection
source() acts like dependency injection by decoupling raw data definitions from transformation logic.
Knowing this connection clarifies why separating configuration (YAML) from code (SQL) improves flexibility and maintainability.
Supply Chain Management
source() is like identifying raw material suppliers in a supply chain before manufacturing products.
Recognizing this analogy helps appreciate the importance of clear source definitions for quality control and traceability.
Common Pitfalls
#1Referencing raw tables directly in SQL without source()
Wrong approach:select * from raw_schema.users
Correct approach:select * from {{ source('raw_data', 'users') }}
Root cause:Not using source() loses metadata benefits and breaks dbt's dependency tracking.
#2Using source() without defining the source in YAML
Wrong approach:select * from {{ source('unknown_source', 'users') }}
Correct approach:Define the source in YAML before using source() in SQL.
Root cause:Missing source definitions cause compilation errors.
#3Hardcoding schema names inside source YAML without environment overrides
Wrong approach:schema: raw_schema_prod (fixed in YAML)
Correct approach:Use variables or profiles to override schema per environment.
Root cause:Fixed schemas reduce flexibility and cause deployment issues.
Key Takeaways
The source() function in dbt connects your models to raw tables by referencing them through YAML-defined sources.
Using source() improves clarity, documentation, testing, and lineage tracking in your data pipeline.
Source definitions must be declared in YAML files before using source() in SQL models.
Advanced source features like freshness checks and environment overrides help build robust production pipelines.
Avoid common mistakes like hardcoding raw table names or skipping source definitions to maintain project health.