Overview - Multi-source fan-in patterns

What is it?

Multi-source fan-in patterns in dbt describe how data from multiple sources is combined into a single, unified dataset. This pattern helps gather information from different tables or databases and merge them for analysis. It simplifies working with scattered data by bringing it together in one place. This is useful when you want to analyze or report on data that lives in different systems.

Why it matters

Without multi-source fan-in patterns, analysts and data engineers would struggle to combine data from various places, leading to duplicated work and inconsistent results. This pattern ensures data is integrated cleanly and efficiently, saving time and reducing errors. It helps businesses get a complete picture by connecting all relevant data points, which is crucial for making informed decisions.

Where it fits

Before learning multi-source fan-in patterns, you should understand basic dbt concepts like models, sources, and ref functions. After mastering this pattern, you can explore advanced data modeling techniques, incremental models, and orchestration strategies to build scalable data pipelines.

Mental Model

Core Idea

Multi-source fan-in patterns gather and merge data from many sources into one clear, combined dataset for easier analysis.

Think of it like...

Imagine several rivers flowing into one big lake. Each river brings water from a different place, and the lake collects all that water in one spot, making it easy to see the total amount.

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ Source A    │   │ Source B    │   │ Source C    │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │       
       └───────┬─────────┴─────────┬───────┘       
               │                   │               
         ┌─────▼───────────────────▼─────┐         
         │       Combined Dataset          │         
         └────────────────────────────────┘

Build-Up - 6 Steps

1

FoundationUnderstanding dbt sources and models

Concept: Learn what sources and models are in dbt and how they represent data.

In dbt, a source is the original data table from your database. A model is a SQL file that transforms data, often using sources or other models. You define sources in your dbt project to tell dbt where raw data lives. Models use the ref() function to refer to other models and source() to refer to sources.

Result

You can write SQL that pulls data from raw tables (sources) and create new tables (models) with transformed data.

Understanding sources and models is key because multi-source fan-in patterns rely on combining multiple sources through models.

2

FoundationBasic SQL joins to combine data

3

IntermediateCreating fan-in models in dbt

4

IntermediateHandling schema differences across sources

5

AdvancedOptimizing fan-in models for performance

6

ExpertManaging fan-in complexity with modular design

Under the Hood

dbt compiles your SQL models into raw SQL queries that run on your database. When you create a fan-in model, dbt resolves all source() and ref() calls to actual table names, then runs the combined SQL with joins to merge data. The database engine executes these joins efficiently using indexes and query plans. dbt manages dependencies so models build in the right order, ensuring data freshness.

Why designed this way?

dbt was designed to separate transformation logic from raw data, making pipelines modular and testable. Multi-source fan-in patterns reflect real-world needs to combine data from diverse systems. Using SQL joins leverages the power of the database engine, avoiding data movement. The modular design supports collaboration and incremental builds, improving speed and reliability.

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│ source() A  │      │ source() B  │      │ source() C  │
└──────┬──────┘      └──────┬──────┘      └──────┬──────┘
       │                     │                     │       
       │                     │                     │       
       ▼                     ▼                     ▼       
┌───────────────────────────────────────────────┐       
│ dbt compiles SQL with JOINs combining sources  │       
└───────────────────────────────────────────────┘       
                       │                                   
                       ▼                                   
              ┌─────────────────┐                         
              │ Database engine │                         
              └─────────────────┘                         
                       │                                   
                       ▼                                   
              ┌─────────────────┐                         
              │ Combined table  │                         
              └─────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Do you think multi-source fan-in always means UNION ALL? Commit yes or no.

Common Belief:Many think fan-in means stacking data vertically using UNION ALL.

Tap to reveal reality

Quick: Do you think all sources must have identical schemas to fan-in? Commit yes or no.

Common Belief:People often believe sources must have the exact same columns and types to combine.

Tap to reveal reality

Quick: Do you think fan-in models always slow down your dbt runs? Commit yes or no.

Common Belief:Some believe fan-in models inherently cause slow performance.

Tap to reveal reality

Quick: Do you think a single fan-in model should combine all sources at once? Commit yes or no.

Common Belief:Many think one big model is best for combining all sources.

Tap to reveal reality

Expert Zone

1

Fan-in models often require careful handling of slowly changing dimensions to avoid data duplication or loss.

2

The order of joins in fan-in models can impact query performance significantly, especially with large datasets.

3

Using dbt's ephemeral models for intermediate fan-in steps can reduce storage costs but may increase query complexity.

When NOT to use

Avoid multi-source fan-in patterns when sources are extremely large and unjoinable due to lack of keys; instead, consider data lake approaches or pre-aggregated datasets. Also, if sources update at very different frequencies, asynchronous pipelines might be better.

Production Patterns

In production, fan-in models are often split into layers: staging models clean raw sources, intermediate models combine related sources, and final fan-in models produce business-ready tables. Teams use dbt's testing and documentation features to ensure data quality across these layers.

Connections

Data Warehousing

Multi-source fan-in patterns build on the principle of integrating data from multiple systems into a central warehouse.

Understanding fan-in helps grasp how data warehouses unify diverse data for analytics.

ETL Pipelines

Fan-in is a common pattern in ETL where Extracted data from various sources is Transformed and Loaded into one place.

Knowing fan-in clarifies how ETL pipelines merge data before analysis.

Supply Chain Management

Like fan-in combines data streams, supply chains merge inputs from multiple suppliers into one product flow.

Recognizing this similarity helps appreciate the complexity and coordination needed in data integration.

Common Pitfalls

#1Joining sources without aligning schemas causes errors.

Wrong approach:SELECT * FROM source_a JOIN source_b ON source_a.id = source_b.id;

Correct approach:SELECT source_a.id, source_a.name AS customer_name, source_b.name AS product_name FROM source_a JOIN source_b ON source_a.id = source_b.id;

Root cause:Assuming columns match exactly without renaming or selecting specific fields.

#2Combining all sources in one huge query makes debugging hard.

Wrong approach:CREATE MODEL fan_in AS SELECT * FROM source_a JOIN source_b JOIN source_c JOIN source_d ...;

Correct approach:CREATE MODEL intermediate_1 AS SELECT ... FROM source_a JOIN source_b; CREATE MODEL intermediate_2 AS SELECT ... FROM source_c JOIN source_d; CREATE MODEL fan_in AS SELECT ... FROM intermediate_1 JOIN intermediate_2;

Root cause:Not breaking down complex joins into manageable steps.

#3Not filtering data early leads to slow queries.

Wrong approach:SELECT * FROM source_a JOIN source_b;

Correct approach:SELECT * FROM (SELECT * FROM source_a WHERE date > '2024-01-01') a JOIN (SELECT * FROM source_b WHERE active = true) b;

Root cause:Ignoring data volume and filtering before joining.

Key Takeaways

Multi-source fan-in patterns combine data from different places into one dataset using SQL joins.

Handling schema differences and filtering data early are crucial for clean and efficient fan-in models.

Breaking complex fan-in logic into smaller models improves maintainability and performance.

dbt manages dependencies and compiles SQL so fan-in models run smoothly on your database.

Understanding fan-in patterns is essential for building scalable, reliable data pipelines that integrate diverse data.