0
0
dbtdata~15 mins

Multi-source fan-in patterns in dbt - Deep Dive

Choose your learning style9 modes available
Overview - Multi-source fan-in patterns
What is it?
Multi-source fan-in patterns in dbt describe how data from multiple sources is combined into a single, unified dataset. This pattern helps gather information from different tables or databases and merge them for analysis. It simplifies working with scattered data by bringing it together in one place. This is useful when you want to analyze or report on data that lives in different systems.
Why it matters
Without multi-source fan-in patterns, analysts and data engineers would struggle to combine data from various places, leading to duplicated work and inconsistent results. This pattern ensures data is integrated cleanly and efficiently, saving time and reducing errors. It helps businesses get a complete picture by connecting all relevant data points, which is crucial for making informed decisions.
Where it fits
Before learning multi-source fan-in patterns, you should understand basic dbt concepts like models, sources, and ref functions. After mastering this pattern, you can explore advanced data modeling techniques, incremental models, and orchestration strategies to build scalable data pipelines.
Mental Model
Core Idea
Multi-source fan-in patterns gather and merge data from many sources into one clear, combined dataset for easier analysis.
Think of it like...
Imagine several rivers flowing into one big lake. Each river brings water from a different place, and the lake collects all that water in one spot, making it easy to see the total amount.
┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ Source A    │   │ Source B    │   │ Source C    │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │       
       └───────┬─────────┴─────────┬───────┘       
               │                   │               
         ┌─────▼───────────────────▼─────┐         
         │       Combined Dataset          │         
         └────────────────────────────────┘         
Build-Up - 6 Steps
1
FoundationUnderstanding dbt sources and models
🤔
Concept: Learn what sources and models are in dbt and how they represent data.
In dbt, a source is the original data table from your database. A model is a SQL file that transforms data, often using sources or other models. You define sources in your dbt project to tell dbt where raw data lives. Models use the ref() function to refer to other models and source() to refer to sources.
Result
You can write SQL that pulls data from raw tables (sources) and create new tables (models) with transformed data.
Understanding sources and models is key because multi-source fan-in patterns rely on combining multiple sources through models.
2
FoundationBasic SQL joins to combine data
🤔
Concept: Learn how to use SQL joins to merge data from different tables.
SQL joins let you combine rows from two or more tables based on related columns. The common types are INNER JOIN (only matching rows), LEFT JOIN (all rows from left table plus matches), and FULL JOIN (all rows from both tables). Joins are the foundation for merging data from multiple sources.
Result
You can write queries that bring together data from different tables into one result set.
Knowing SQL joins is essential because multi-source fan-in patterns use joins to merge data from multiple sources.
3
IntermediateCreating fan-in models in dbt
🤔Before reading on: do you think a fan-in model uses UNION or JOIN to combine sources? Commit to your answer.
Concept: Learn how to build a dbt model that combines multiple sources using SQL joins.
A fan-in model in dbt is a model that pulls data from multiple sources or models and merges them into one table. Typically, you write a SQL file that uses JOINs to combine these sources on common keys. You use source() to reference raw tables and ref() for other models. The model creates a single table that contains all relevant data.
Result
The fan-in model produces a unified table combining data from all specified sources.
Understanding how to write fan-in models helps you centralize data, making downstream analysis simpler and more consistent.
4
IntermediateHandling schema differences across sources
🤔Before reading on: do you think all sources must have identical columns to fan-in? Commit to yes or no.
Concept: Learn how to manage differences in column names and data types when combining sources.
Often, sources have different column names or formats. You can use SQL aliases to rename columns and CAST to convert data types so they match. You may also select only the needed columns from each source. This harmonization is crucial before joining data to avoid errors and mismatches.
Result
You get a clean, consistent dataset where columns align properly across sources.
Knowing how to handle schema differences prevents bugs and ensures your combined data is accurate and usable.
5
AdvancedOptimizing fan-in models for performance
🤔Before reading on: do you think joining many large sources always slows down queries? Commit to yes or no.
Concept: Learn techniques to improve query speed and resource use in fan-in models.
Joining many large tables can be slow. To optimize, filter data early using WHERE clauses, select only needed columns, and use incremental models to process only new data. Also, consider materializing fan-in models as tables instead of views to speed up repeated queries. dbt's built-in caching and dependency graph help manage this efficiently.
Result
Fan-in models run faster and use fewer resources, making your data pipeline more scalable.
Understanding optimization techniques helps you build fan-in models that work well even with big data.
6
ExpertManaging fan-in complexity with modular design
🤔Before reading on: do you think a single fan-in model should combine all sources at once or be broken into smaller steps? Commit to your answer.
Concept: Learn how to break complex fan-in patterns into smaller, manageable models for clarity and maintainability.
When many sources are involved, combining all at once can create complex, hard-to-maintain SQL. Instead, create intermediate models that combine subsets of sources, then build a final fan-in model that merges these intermediates. This modular approach improves readability, testing, and debugging. It also leverages dbt's dependency graph for efficient builds.
Result
Your project is easier to understand, maintain, and extend as data sources grow.
Knowing how to modularize fan-in patterns prevents technical debt and supports team collaboration.
Under the Hood
dbt compiles your SQL models into raw SQL queries that run on your database. When you create a fan-in model, dbt resolves all source() and ref() calls to actual table names, then runs the combined SQL with joins to merge data. The database engine executes these joins efficiently using indexes and query plans. dbt manages dependencies so models build in the right order, ensuring data freshness.
Why designed this way?
dbt was designed to separate transformation logic from raw data, making pipelines modular and testable. Multi-source fan-in patterns reflect real-world needs to combine data from diverse systems. Using SQL joins leverages the power of the database engine, avoiding data movement. The modular design supports collaboration and incremental builds, improving speed and reliability.
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│ source() A  │      │ source() B  │      │ source() C  │
└──────┬──────┘      └──────┬──────┘      └──────┬──────┘
       │                     │                     │       
       │                     │                     │       
       ▼                     ▼                     ▼       
┌───────────────────────────────────────────────┐       
│ dbt compiles SQL with JOINs combining sources  │       
└───────────────────────────────────────────────┘       
                       │                                   
                       ▼                                   
              ┌─────────────────┐                         
              │ Database engine │                         
              └─────────────────┘                         
                       │                                   
                       ▼                                   
              ┌─────────────────┐                         
              │ Combined table  │                         
              └─────────────────┘                         
Myth Busters - 4 Common Misconceptions
Quick: Do you think multi-source fan-in always means UNION ALL? Commit yes or no.
Common Belief:Many think fan-in means stacking data vertically using UNION ALL.
Tap to reveal reality
Reality:Fan-in usually means joining data horizontally from multiple sources, not just stacking rows.
Why it matters:Confusing fan-in with UNION leads to wrong data shapes and incorrect analysis.
Quick: Do you think all sources must have identical schemas to fan-in? Commit yes or no.
Common Belief:People often believe sources must have the exact same columns and types to combine.
Tap to reveal reality
Reality:Sources can have different schemas; you can rename and cast columns to align them before joining.
Why it matters:Assuming identical schemas limits your ability to integrate diverse data and causes errors.
Quick: Do you think fan-in models always slow down your dbt runs? Commit yes or no.
Common Belief:Some believe fan-in models inherently cause slow performance.
Tap to reveal reality
Reality:With proper filtering, incremental builds, and materializations, fan-in models can be efficient.
Why it matters:Believing fan-in is slow may discourage using it, missing out on clean data integration.
Quick: Do you think a single fan-in model should combine all sources at once? Commit yes or no.
Common Belief:Many think one big model is best for combining all sources.
Tap to reveal reality
Reality:Breaking fan-in into smaller models improves maintainability and debugging.
Why it matters:Ignoring modular design leads to complex, fragile pipelines that are hard to fix.
Expert Zone
1
Fan-in models often require careful handling of slowly changing dimensions to avoid data duplication or loss.
2
The order of joins in fan-in models can impact query performance significantly, especially with large datasets.
3
Using dbt's ephemeral models for intermediate fan-in steps can reduce storage costs but may increase query complexity.
When NOT to use
Avoid multi-source fan-in patterns when sources are extremely large and unjoinable due to lack of keys; instead, consider data lake approaches or pre-aggregated datasets. Also, if sources update at very different frequencies, asynchronous pipelines might be better.
Production Patterns
In production, fan-in models are often split into layers: staging models clean raw sources, intermediate models combine related sources, and final fan-in models produce business-ready tables. Teams use dbt's testing and documentation features to ensure data quality across these layers.
Connections
Data Warehousing
Multi-source fan-in patterns build on the principle of integrating data from multiple systems into a central warehouse.
Understanding fan-in helps grasp how data warehouses unify diverse data for analytics.
ETL Pipelines
Fan-in is a common pattern in ETL where Extracted data from various sources is Transformed and Loaded into one place.
Knowing fan-in clarifies how ETL pipelines merge data before analysis.
Supply Chain Management
Like fan-in combines data streams, supply chains merge inputs from multiple suppliers into one product flow.
Recognizing this similarity helps appreciate the complexity and coordination needed in data integration.
Common Pitfalls
#1Joining sources without aligning schemas causes errors.
Wrong approach:SELECT * FROM source_a JOIN source_b ON source_a.id = source_b.id;
Correct approach:SELECT source_a.id, source_a.name AS customer_name, source_b.name AS product_name FROM source_a JOIN source_b ON source_a.id = source_b.id;
Root cause:Assuming columns match exactly without renaming or selecting specific fields.
#2Combining all sources in one huge query makes debugging hard.
Wrong approach:CREATE MODEL fan_in AS SELECT * FROM source_a JOIN source_b JOIN source_c JOIN source_d ...;
Correct approach:CREATE MODEL intermediate_1 AS SELECT ... FROM source_a JOIN source_b; CREATE MODEL intermediate_2 AS SELECT ... FROM source_c JOIN source_d; CREATE MODEL fan_in AS SELECT ... FROM intermediate_1 JOIN intermediate_2;
Root cause:Not breaking down complex joins into manageable steps.
#3Not filtering data early leads to slow queries.
Wrong approach:SELECT * FROM source_a JOIN source_b;
Correct approach:SELECT * FROM (SELECT * FROM source_a WHERE date > '2024-01-01') a JOIN (SELECT * FROM source_b WHERE active = true) b;
Root cause:Ignoring data volume and filtering before joining.
Key Takeaways
Multi-source fan-in patterns combine data from different places into one dataset using SQL joins.
Handling schema differences and filtering data early are crucial for clean and efficient fan-in models.
Breaking complex fan-in logic into smaller models improves maintainability and performance.
dbt manages dependencies and compiles SQL so fan-in models run smoothly on your database.
Understanding fan-in patterns is essential for building scalable, reliable data pipelines that integrate diverse data.