0
0
DbtDebug / FixIntermediate · 4 min read

How to Handle Late Arriving Data in dbt: Fix and Best Practices

In dbt, late arriving data can be handled by using incremental models with proper unique_key and merge strategies to update existing records. This ensures that when older data arrives late, it updates the model correctly instead of being ignored or duplicated.
🔍

Why This Happens

Late arriving data happens when data for past dates or events arrives after the initial data load. In dbt, if your incremental model only appends new data without updating existing records, late data will be missed or cause duplicates.

This usually occurs because the incremental model is set to append only, without a proper unique_key or merge logic to update existing rows.

yaml
incremental_strategy: append

-- model.sql
select * from source_table
where updated_at > (select max(updated_at) from {{ this }})
Output
Duplicates or missing updates for late arriving data in the final table.
🔧

The Fix

Change your incremental model to use incremental_strategy: merge and specify a unique_key. This tells dbt to update existing rows when late data arrives instead of just appending.

Also, ensure your where clause captures all relevant changes, including late arriving data.

yaml
incremental_strategy: merge
unique_key: id

-- model.sql
select * from source_table
where updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
Output
Existing rows updated and late arriving data correctly merged without duplicates.
🛡️

Prevention

To avoid issues with late arriving data in the future:

  • Always use incremental_strategy: merge with a proper unique_key for incremental models.
  • Design your where clause to capture updates and late data, not just new rows.
  • Test your incremental logic with sample late data to confirm updates happen correctly.
  • Document your data freshness and update policies clearly for your team.
⚠️

Related Errors

Common related errors include:

  • Duplicate rows: caused by missing unique_key in incremental models.
  • Missing updates: when incremental_strategy is set to append only.
  • Incorrect data freshness: if the where clause does not cover late arriving data timestamps.

Key Takeaways

Use incremental models with merge strategy and unique_key to handle late arriving data.
Ensure your incremental filter captures updates and late data, not just new rows.
Test incremental logic with late data scenarios to avoid duplicates or missing updates.
Document your data update policies to keep your team aligned.
Avoid append-only incremental models when late arriving data is expected.