Overview - Full refresh vs incremental

What is it?

Full refresh and incremental are two ways to update data in a database using dbt. A full refresh rebuilds the entire dataset from scratch every time. Incremental updates only add or change the new or modified data since the last update. These methods help keep data fresh and accurate for analysis.

Why it matters

Without these update methods, data would become outdated or require too much time and computing power to refresh. Full refresh ensures complete accuracy but can be slow for large data. Incremental saves time and resources by updating only what changed. Choosing the right method affects how fast and reliable your data is for decisions.

Where it fits

Learners should know basic SQL and how dbt models work before this. After understanding full refresh and incremental, learners can explore advanced dbt features like snapshots and incremental merge strategies.

Mental Model

Core Idea

Full refresh rebuilds everything every time, while incremental updates only add or change what is new or different.

Think of it like...

It's like cleaning a room: full refresh is cleaning the entire room from top to bottom every time, while incremental is just tidying up the parts that got messy since last time.

┌───────────────┐       ┌───────────────┐
│   Full Refresh│       │  Incremental  │
├───────────────┤       ├───────────────┤
│ Delete all old│       │ Keep old data │
│ Rebuild all   │──────▶│ Add new data  │
│ data from raw │       │ Update changed│
│ sources       │       │ data only     │
└───────────────┘       └───────────────┘

Build-Up - 7 Steps

1

FoundationWhat is a full refresh

Concept: Full refresh means rebuilding the entire dataset from scratch.

In dbt, a full refresh deletes the existing table and recreates it completely using the source data. This ensures the data is fully up to date but can take longer for large datasets.

Result

The database table contains only fresh data from the latest run, with no leftovers from before.

Understanding full refresh shows how dbt guarantees data accuracy by starting clean each time.

2

FoundationWhat is incremental update

3

IntermediateHow dbt implements full refresh

4

IntermediateHow dbt implements incremental models

5

IntermediateChoosing between full refresh and incremental

6

AdvancedHandling schema changes in incremental models

7

ExpertAdvanced incremental merge strategies

Under the Hood

Full refresh works by dropping the entire target table and recreating it from source data, ensuring a clean slate. Incremental updates run a query that selects only new or changed rows and merges them into the existing table using keys. This merge can be an insert or update depending on the logic. Internally, dbt compiles SQL to perform these operations efficiently on the database engine.

Why designed this way?

Full refresh was designed for simplicity and correctness, guaranteeing fresh data but at a cost. Incremental was introduced to optimize performance for large datasets by avoiding full rebuilds. The design balances ease of use with efficiency, letting users choose based on their needs. Alternatives like change data capture exist but are more complex to implement.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Source Data   │──────▶│ Full Refresh  │──────▶│ Drop & Rebuild│
│ (Raw tables)  │       │ Process      │       │ Entire Table  │
└───────────────┘       └───────────────┘       └───────────────┘

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Source Data   │──────▶│ Incremental   │──────▶│ Merge New &   │
│ (Raw tables)  │       │ Process      │       │ Changed Rows  │
└───────────────┘       └───────────────┘       └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does incremental update always guarantee fully accurate data? Commit yes or no.

Common Belief:Incremental updates always keep data perfectly accurate without issues.

Tap to reveal reality

Quick: Does full refresh always take longer than incremental? Commit yes or no.

Common Belief:Full refresh is always slower than incremental updates.

Tap to reveal reality

Quick: Do incremental models automatically handle schema changes? Commit yes or no.

Common Belief:Incremental models adapt automatically to schema changes like new columns.

Tap to reveal reality

Quick: Is full refresh always necessary after any data change? Commit yes or no.

Common Belief:You must run full refresh every time data changes to be safe.

Tap to reveal reality

Expert Zone

1

Incremental models require careful choice of unique keys and update logic to avoid data duplication or loss.

2

Some databases support advanced merge commands that can simplify incremental updates but require custom SQL.

3

Full refresh can be combined with incremental by scheduling periodic full refreshes to reset data state.

When NOT to use

Avoid incremental models when data sources lack reliable unique keys or when schema changes frequently. Use full refresh or snapshots instead. Incremental is also not ideal for very small datasets where full refresh is fast enough.

Production Patterns

In production, teams often use incremental models for daily updates and schedule full refreshes weekly or monthly. They also implement tests to detect incremental failures and use database-specific merge features for performance.

Connections

Change Data Capture (CDC)

Builds-on

Incremental updates in dbt are a simplified form of CDC, capturing only changes to update data efficiently.

ETL Pipelines

Same pattern

Full refresh and incremental updates are core patterns in ETL pipelines to manage data freshness and resource use.

Software Version Control

Opposite pattern

Full refresh is like resetting to a clean commit, while incremental updates are like applying patches or commits incrementally.

Common Pitfalls

#1Running incremental model without a unique key causes duplicate rows.

Wrong approach:SELECT * FROM source_table WHERE updated_at > (SELECT MAX(updated_at) FROM target_table)

Correct approach:SELECT * FROM source_table WHERE updated_at > (SELECT MAX(updated_at) FROM target_table) AND unique_id IS NOT NULL

Root cause:Not enforcing a unique key means dbt cannot correctly merge or update rows, causing duplicates.

#2Running incremental model after schema change without full refresh causes errors.

Wrong approach:Run incremental model after adding a new column without adjusting model or refreshing.

Correct approach:Run dbt with --full-refresh after schema change to rebuild table with new columns.

Root cause:Incremental logic does not handle schema changes automatically, requiring manual refresh.

#3Using full refresh on very large datasets daily causes slow pipelines.

Wrong approach:dbt run --full-refresh every day on multi-million row tables.

Correct approach:Use incremental models for daily updates and schedule full refresh less frequently.

Root cause:Not balancing resource use and data freshness leads to inefficient workflows.

Key Takeaways

Full refresh rebuilds the entire dataset from scratch, ensuring complete accuracy but can be slow.

Incremental updates add or change only new or modified data, saving time and resources.

Choosing between full refresh and incremental depends on data size, update frequency, and schema stability.

Incremental models require careful design of unique keys and update logic to avoid errors.

Advanced incremental merges use database features like MERGE for efficient upserts in production.