How to Handle Late Arriving Data in dbt: Fix and Best Practices
In
dbt, late arriving data can be handled by using incremental models with proper unique_key and merge strategies to update existing records. This ensures that when older data arrives late, it updates the model correctly instead of being ignored or duplicated.Why This Happens
Late arriving data happens when data for past dates or events arrives after the initial data load. In dbt, if your incremental model only appends new data without updating existing records, late data will be missed or cause duplicates.
This usually occurs because the incremental model is set to append only, without a proper unique_key or merge logic to update existing rows.
yaml
incremental_strategy: append -- model.sql select * from source_table where updated_at > (select max(updated_at) from {{ this }})
Output
Duplicates or missing updates for late arriving data in the final table.
The Fix
Change your incremental model to use incremental_strategy: merge and specify a unique_key. This tells dbt to update existing rows when late data arrives instead of just appending.
Also, ensure your where clause captures all relevant changes, including late arriving data.
yaml
incremental_strategy: merge unique_key: id -- model.sql select * from source_table where updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
Output
Existing rows updated and late arriving data correctly merged without duplicates.
Prevention
To avoid issues with late arriving data in the future:
- Always use
incremental_strategy: mergewith a properunique_keyfor incremental models. - Design your
whereclause to capture updates and late data, not just new rows. - Test your incremental logic with sample late data to confirm updates happen correctly.
- Document your data freshness and update policies clearly for your team.
Related Errors
Common related errors include:
- Duplicate rows: caused by missing
unique_keyin incremental models. - Missing updates: when
incremental_strategyis set to append only. - Incorrect data freshness: if the
whereclause does not cover late arriving data timestamps.
Key Takeaways
Use incremental models with merge strategy and unique_key to handle late arriving data.
Ensure your incremental filter captures updates and late data, not just new rows.
Test incremental logic with late data scenarios to avoid duplicates or missing updates.
Document your data update policies to keep your team aligned.
Avoid append-only incremental models when late arriving data is expected.