0
0
dbtdata~15 mins

Staging, intermediate, and marts pattern in dbt - Deep Dive

Choose your learning style9 modes available
Overview - Staging, intermediate, and marts pattern
What is it?
The staging, intermediate, and marts pattern is a way to organize data transformations in dbt projects. It breaks down the process into three layers: staging cleans and prepares raw data, intermediate applies business logic and combines data, and marts create final tables for analysis. This structure helps keep data workflows clear and manageable.
Why it matters
Without this pattern, data transformations can become messy and hard to maintain, leading to errors and slow analysis. Organizing work into layers makes it easier to find problems, reuse code, and deliver reliable data quickly. It helps teams work together smoothly and supports better decision-making with clean data.
Where it fits
Learners should first understand basic SQL and dbt concepts like models and dependencies. After mastering this pattern, they can explore advanced dbt features like snapshots, tests, and documentation to build robust data pipelines.
Mental Model
Core Idea
Breaking data transformations into clear layers—staging, intermediate, and marts—creates a clean, reusable, and understandable flow from raw data to final analysis tables.
Think of it like...
It's like cooking a meal: staging is washing and chopping ingredients, intermediate is cooking and mixing them into dishes, and marts are plating the food beautifully for guests to enjoy.
┌─────────────┐     ┌───────────────┐     ┌─────────────┐
│   Staging   │────▶│ Intermediate  │────▶│    Marts    │
│ (clean raw) │     │ (business logic)│     │ (final data)│
└─────────────┘     └───────────────┘     └─────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding Raw Data Sources
🤔
Concept: Raw data is the starting point and often messy or inconsistent.
Raw data comes from places like databases or files. It may have missing values, duplicates, or inconsistent formats. Before using it, we need to clean and organize it.
Result
You recognize that raw data needs preparation before analysis.
Understanding raw data's imperfections explains why we need a staging layer to clean it first.
2
FoundationIntroduction to dbt Models
🤔
Concept: dbt models are SQL files that transform data step-by-step.
In dbt, each model is a SQL query saved as a file. Running dbt builds these models in order, creating tables or views. Models can depend on each other, forming a chain of transformations.
Result
You can create simple dbt models and understand dependencies.
Knowing models and dependencies is key to organizing transformations into layers.
3
IntermediateCreating the Staging Layer
🤔Before reading on: do you think staging models should contain complex business logic or just clean raw data? Commit to your answer.
Concept: The staging layer cleans and standardizes raw data without adding business rules.
Staging models select raw tables, fix data types, rename columns for clarity, and handle missing values. They do not combine data or apply business logic. This keeps raw data consistent and easy to use.
Result
You have clean, consistent tables ready for further processing.
Separating cleaning from logic prevents errors and makes debugging easier.
4
IntermediateBuilding the Intermediate Layer
🤔Before reading on: do you think intermediate models combine data from multiple sources or just one? Commit to your answer.
Concept: Intermediate models apply business logic and combine staging tables.
This layer joins staging tables, filters data, calculates new fields, and applies rules specific to the business. It prepares data for final reporting but is not yet the final output.
Result
You create tables that reflect business concepts and are ready for analysis.
Isolating business logic here makes it reusable and easier to update.
5
IntermediateDesigning the Marts Layer
🤔
Concept: Marts are the final tables tailored for reporting and analysis.
Marts select from intermediate models and shape data for specific use cases like sales reports or customer insights. They often aggregate data and optimize for fast queries.
Result
You produce clean, user-friendly tables ready for dashboards or analysts.
Having a dedicated layer for final outputs improves performance and clarity.
6
AdvancedManaging Dependencies and Testing
🤔Before reading on: do you think tests should be applied at staging, intermediate, or marts layers? Commit to your answer.
Concept: Testing and managing dependencies ensure data quality across layers.
dbt allows tests like uniqueness or null checks on any model. Applying tests early in staging catches errors quickly. Managing dependencies ensures models build in the right order and changes propagate safely.
Result
You maintain reliable data pipelines with fewer errors.
Early testing and clear dependencies prevent costly mistakes downstream.
7
ExpertOptimizing for Performance and Maintenance
🤔Before reading on: do you think layering always improves performance or can it sometimes add overhead? Commit to your answer.
Concept: Layering helps maintainability but requires balancing performance and complexity.
Too many layers can slow down builds and queries. Experts optimize by materializing models appropriately (tables, views, incremental), pruning unnecessary layers, and documenting dependencies. They also modularize code for reuse.
Result
You build efficient, maintainable dbt projects that scale well.
Knowing when to simplify or optimize layers is key to professional data engineering.
Under the Hood
dbt compiles SQL models into executable queries, respecting dependencies to build tables or views in order. The staging layer runs first, cleaning raw data into consistent tables. Intermediate models then run, applying business logic and joining staging tables. Finally, marts run to produce analysis-ready tables. dbt manages this flow automatically, tracking changes and rebuilding affected models.
Why designed this way?
This layered design was created to separate concerns: cleaning, logic, and presentation. It makes projects easier to understand, test, and maintain. Alternatives like monolithic SQL scripts were hard to debug and reuse. The pattern supports collaboration and incremental development.
┌─────────────┐
│   Raw Data  │
└─────┬───────┘
      │
┌─────▼───────┐
│  Staging    │
│ (cleaning)  │
└─────┬───────┘
      │
┌─────▼──────────┐
│ Intermediate   │
│ (business logic)│
└─────┬──────────┘
      │
┌─────▼───────┐
│   Marts     │
│ (final data)│
└─────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Is it okay to put business logic directly in staging models? Commit yes or no.
Common Belief:Many think staging models should include business logic to save time.
Tap to reveal reality
Reality:Staging models should only clean and standardize raw data; business logic belongs in intermediate models.
Why it matters:Mixing logic in staging makes debugging harder and reduces reusability of clean data.
Quick: Do you think marts should always be simple views? Commit yes or no.
Common Belief:Some believe marts should always be views to avoid storage costs.
Tap to reveal reality
Reality:Marts are often materialized as tables for performance, especially with large data or complex queries.
Why it matters:Using only views can cause slow queries and poor user experience.
Quick: Does layering always improve performance? Commit yes or no.
Common Belief:Many assume layering automatically makes data pipelines faster.
Tap to reveal reality
Reality:Layering improves maintainability but can add overhead if not optimized properly.
Why it matters:Ignoring performance tradeoffs can lead to slow builds and queries.
Quick: Is it safe to skip testing in staging because data is raw? Commit yes or no.
Common Belief:Some think testing raw data is unnecessary since it comes from trusted sources.
Tap to reveal reality
Reality:Raw data often has errors; testing in staging catches issues early.
Why it matters:Skipping tests leads to bad data propagating downstream, causing wrong decisions.
Expert Zone
1
Staging models often mirror source tables but rename columns to a consistent naming convention, which is crucial for downstream clarity.
2
Intermediate models can be reused across multiple marts, enabling modular and DRY (Don't Repeat Yourself) transformations.
3
Materialization strategies (table, view, incremental) must be chosen carefully per layer to balance build time and query speed.
When NOT to use
This pattern is less suitable for very small projects where layering adds unnecessary complexity. In such cases, simpler flat models or direct transformations may be better. Also, real-time streaming data pipelines often require different architectures.
Production Patterns
In production, teams use this pattern combined with automated testing, documentation, and CI/CD pipelines. They version control dbt projects, use incremental models for large datasets, and separate environments for development and production to ensure data quality and reliability.
Connections
Software Engineering Layered Architecture
This data layering pattern mirrors software design layers like presentation, business logic, and data access.
Understanding software layering helps grasp why separating data cleaning, logic, and presentation improves maintainability and collaboration.
ETL Pipelines
The pattern is a modern, modular approach to traditional Extract-Transform-Load processes.
Knowing ETL basics clarifies how dbt layers replace monolithic transformations with reusable, testable steps.
Cooking Process
Like cooking stages (prep, cook, plate), data transformations progress through cleaning, logic, and final presentation.
This analogy helps understand the importance of order and separation in complex workflows.
Common Pitfalls
#1Putting business logic in staging models.
Wrong approach:select id, case when status = 'active' then 1 else 0 end as is_active from raw.users
Correct approach:select id, status from raw.users -- staging only cleans and renames columns -- business logic in intermediate: select id, case when status = 'active' then 1 else 0 end as is_active from staging.users
Root cause:Confusing cleaning with business logic leads to mixing concerns and harder maintenance.
#2Materializing all models as views causing slow queries.
Wrong approach:models: marts: materialized: view
Correct approach:models: marts: materialized: table
Root cause:Not considering query performance and data size leads to inefficient data access.
#3Skipping tests on staging models.
Wrong approach:No tests defined on staging models.
Correct approach:tests: - unique: id - not_null: id applied to staging models
Root cause:Assuming raw data is perfect causes errors to propagate unnoticed.
Key Takeaways
Organizing data transformations into staging, intermediate, and marts layers creates clear separation of concerns.
Staging cleans raw data without applying business rules, ensuring consistent inputs for later steps.
Intermediate models apply business logic and combine data, making transformations reusable and testable.
Marts produce final tables optimized for analysis and reporting, improving user experience.
Balancing layering with performance and testing at each stage is essential for reliable, maintainable data pipelines.