0
0
dbtdata~15 mins

Incremental strategies (append, merge, delete+insert) in dbt - Deep Dive

Choose your learning style9 modes available
Overview - Incremental strategies (append, merge, delete+insert)
What is it?
Incremental strategies in dbt are ways to update data models efficiently by only processing new or changed data instead of rebuilding everything from scratch. The three main strategies are append, merge, and delete+insert. Append adds new rows, merge updates existing rows and adds new ones, and delete+insert replaces data by deleting old rows and inserting fresh ones. These methods help save time and computing resources when working with large datasets.
Why it matters
Without incremental strategies, every data update would require rebuilding entire tables, which can be slow and costly. Incremental updates let data teams refresh only what changed, making pipelines faster and more efficient. This means quicker insights and less strain on data warehouses, which is crucial for businesses relying on timely data decisions.
Where it fits
Learners should first understand basic SQL and dbt models before tackling incremental strategies. After mastering these, they can explore advanced data engineering topics like data freshness, partitioning, and orchestration. Incremental strategies are a key step toward building scalable, maintainable data pipelines.
Mental Model
Core Idea
Incremental strategies update data by processing only new or changed parts instead of rebuilding everything.
Think of it like...
It's like updating a photo album by adding new pictures, fixing some existing ones, or replacing old pages, instead of printing a whole new album every time.
┌───────────────┐
│ Full Refresh  │
│ (Rebuild all) │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Incremental Strategies       │
│ ┌───────────┐ ┌───────────┐ │
│ │ Append    │ │ Merge     │ │
│ │ (Add new) │ │ (Add+Upd) │ │
│ └───────────┘ └───────────┘ │
│          ┌───────────────┐  │
│          │ Delete+Insert │  │
│          │ (Replace data)│  │
│          └───────────────┘  │
└─────────────────────────────┘
Build-Up - 7 Steps
1
FoundationWhat is Incremental Loading
🤔
Concept: Introduce the basic idea of incremental loading as updating data by adding only new information.
Imagine you have a list of sales every day. Instead of rewriting the whole list every time, you just add the new day's sales to the end. This is incremental loading — adding new data without touching old data.
Result
You save time and resources by not reprocessing old data.
Understanding incremental loading helps you see why full rebuilds are often unnecessary and inefficient.
2
Foundationdbt Incremental Model Basics
🤔
Concept: Learn how dbt supports incremental models to update tables efficiently.
In dbt, you can write models that only process new or changed data by using the 'is_incremental()' function. This lets dbt know when to append or update data instead of rebuilding the whole table.
Result
dbt runs faster and uses less warehouse resources during incremental runs.
Knowing how dbt detects incremental runs is key to writing efficient models.
3
IntermediateAppend Strategy Explained
🤔Before reading on: do you think append can update existing rows or only add new ones? Commit to your answer.
Concept: Append adds only new rows to the existing table without changing old rows.
Append strategy assumes data never changes once added. For example, daily logs are added as new rows. The model filters new data and inserts it at the end of the existing table.
Result
The table grows by adding new rows, old data stays untouched.
Understanding append clarifies when it's safe to just add data without checking for updates.
4
IntermediateMerge Strategy Explained
🤔Before reading on: do you think merge can handle both new and updated rows? Commit to your answer.
Concept: Merge updates existing rows if they changed and adds new rows if they are new.
Merge uses a unique key to find matching rows. If a row exists, it updates it; if not, it inserts it. This is useful when data can change, like customer info or product prices.
Result
The table stays up-to-date with both new and changed data.
Knowing merge helps you handle data that changes over time, not just grows.
5
IntermediateDelete+Insert Strategy Explained
🤔Before reading on: do you think delete+insert removes all data or just some? Commit to your answer.
Concept: Delete+insert removes specific old data and replaces it with fresh data.
This strategy deletes rows matching a condition (like a date range) and inserts new rows for that range. It's useful when data corrections or reprocessing are needed.
Result
The affected data is fully refreshed without rebuilding the entire table.
Understanding delete+insert shows how to handle data corrections efficiently.
6
AdvancedChoosing the Right Incremental Strategy
🤔Before reading on: which strategy would you pick for data that never changes? Commit to your answer.
Concept: Different data scenarios require different incremental strategies for best performance and accuracy.
For append-only data, use append. For data that changes, use merge. For data needing full refresh in parts, use delete+insert. Consider data volume, update frequency, and warehouse capabilities.
Result
You select the most efficient and correct strategy for your data pipeline.
Knowing how to match strategy to data type prevents errors and optimizes performance.
7
ExpertHandling Complex Incremental Edge Cases
🤔Before reading on: do you think incremental models always guarantee perfect data consistency? Commit to your answer.
Concept: Incremental models can face challenges like late-arriving data, duplicates, or partial updates requiring careful handling.
Sometimes data arrives late or updates overlap. You may need to combine strategies or add logic to detect and fix duplicates. Also, warehouse-specific SQL features affect how merges and deletes work.
Result
You build robust incremental models that handle real-world data quirks.
Understanding edge cases prepares you to avoid subtle bugs and maintain data quality in production.
Under the Hood
dbt incremental models run SQL that checks if the table exists and if the run is incremental. For append, it inserts filtered new rows. For merge, it uses SQL MERGE or equivalent commands to update and insert rows based on keys. For delete+insert, it deletes rows matching criteria then inserts fresh data. The warehouse executes these commands efficiently using indexes and partitions.
Why designed this way?
Incremental strategies were designed to reduce resource use and speed up data pipelines by avoiding full table rebuilds. Early data tools rebuilt everything, which was slow and costly. Incremental methods balance freshness and efficiency, adapting to different data change patterns.
┌───────────────────────────────┐
│ dbt Incremental Model Process  │
├───────────────┬───────────────┤
│ Check if table exists          │
│ and if run is incremental     │
├───────────────┴───────────────┤
│ If full refresh:               │
│   Rebuild entire table        │
│ Else if incremental:          │
│   ┌───────────────┐           │
│   │ Append        │           │
│   │ Insert new rows│           │
│   ├───────────────┤           │
│   │ Merge         │           │
│   │ Update+Insert │           │
│   ├───────────────┤           │
│   │ Delete+Insert │           │
│   │ Delete old +  │           │
│   │ Insert new    │           │
│   └───────────────┘           │
└───────────────────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does append strategy update existing rows or only add new ones? Commit to yes or no.
Common Belief:Append strategy can update existing rows if data changes.
Tap to reveal reality
Reality:Append only adds new rows and never updates existing ones.
Why it matters:Using append when data changes leads to stale or duplicate data, causing inaccurate reports.
Quick: Does merge strategy always guarantee no duplicates? Commit to yes or no.
Common Belief:Merge strategy automatically prevents all duplicates.
Tap to reveal reality
Reality:Merge relies on correct unique keys; if keys are wrong or missing, duplicates can still occur.
Why it matters:Assuming merge fixes duplicates without proper keys can cause data integrity issues.
Quick: Is delete+insert the same as a full table rebuild? Commit to yes or no.
Common Belief:Delete+insert is just a full rebuild under a different name.
Tap to reveal reality
Reality:Delete+insert only replaces specific partitions or subsets, not the entire table.
Why it matters:Misunderstanding this can lead to inefficient full rebuilds when partial refreshes would suffice.
Quick: Do incremental models always run faster than full refreshes? Commit to yes or no.
Common Belief:Incremental models are always faster than full refreshes.
Tap to reveal reality
Reality:If incremental logic is complex or data changes widely, incremental runs can be slower or as slow as full refreshes.
Why it matters:Blindly trusting incremental speed can cause unexpected pipeline delays and resource use.
Expert Zone
1
Incremental strategies depend heavily on the uniqueness and stability of keys; changing keys can break merges.
2
Warehouse-specific SQL features like MERGE syntax or partition pruning affect how incremental strategies perform and must be understood deeply.
3
Combining incremental strategies with snapshotting or change data capture can create powerful, reliable pipelines but requires careful orchestration.
When NOT to use
Incremental strategies are not suitable when data changes affect large portions of the table or when data lineage requires full rebuilds for accuracy. In such cases, full refresh or snapshot models are better alternatives.
Production Patterns
In production, teams often use append for log data, merge for slowly changing dimensions, and delete+insert for partitioned fact tables with late-arriving data. They also automate incremental runs with freshness checks and alerting to catch failures early.
Connections
Change Data Capture (CDC)
Incremental strategies build on CDC by applying changes efficiently to data models.
Understanding CDC helps grasp how incremental updates track and apply only data changes, improving pipeline efficiency.
Database Indexing
Incremental merges rely on indexes to quickly find and update rows.
Knowing indexing principles explains why merge operations can be fast or slow depending on key design.
Version Control Systems (e.g., Git)
Incremental updates in data are similar to commits in version control, where only changes are saved.
Seeing incremental data updates like code commits clarifies the value of tracking and applying only differences.
Common Pitfalls
#1Using append strategy on data that changes over time.
Wrong approach:if is_incremental(): select * from source where date > (select max(date) from target)
Correct approach:if is_incremental(): merge into target using source on key when matched then update when not matched then insert
Root cause:Misunderstanding that append only adds new rows and cannot update existing data.
#2Not defining a unique key for merge strategy.
Wrong approach:merge into target using source on source.date = target.date when matched then update set ...
Correct approach:merge into target using source on source.primary_key = target.primary_key when matched then update set ...
Root cause:Assuming any column can serve as a unique key without ensuring uniqueness.
#3Deleting entire table instead of partition for delete+insert.
Wrong approach:delete from target; insert into target select * from source where date = '2024-01-01'
Correct approach:delete from target where date = '2024-01-01'; insert into target select * from source where date = '2024-01-01'
Root cause:Confusing full table refresh with partial partition refresh.
Key Takeaways
Incremental strategies update data efficiently by processing only new or changed data, saving time and resources.
Append adds new rows only, merge updates existing rows and adds new ones, and delete+insert replaces specific data subsets.
Choosing the right strategy depends on data characteristics like mutability and volume.
Understanding warehouse SQL features and data keys is crucial for reliable incremental models.
Incremental updates are powerful but require careful design to avoid data quality issues.