dbtdata~10 mins

Handling late-arriving data in dbt - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Handling late-arriving data

Data arrives

↓

Check if data is late?

No→Process normally

Yes↓

Apply late data handling logic

↓

Update existing records or append

↓

Recalculate aggregates if needed

↓

Final clean dataset ready

Data arrives and is checked if late. If late, special logic updates or appends data, then aggregates are recalculated.

Execution Sample

dbt

with base as (
  select * from source_table
),
late_data as (
  select * from base where date < current_date - interval '7' day
),
final as (
  select * from base
  union all
  select * from late_data
)
select * from final

This dbt SQL snippet identifies late data older than 7 days and combines it with the base data for final processing.

Execution Table

Step	Action	Data Sample	Resulting Data State
1	Load source_table	[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}]	All data loaded
2	Filter late data (date < current_date - 7)	[{id:1, date:2024-06-01}]	Late data isolated
3	Combine base and late_data	[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}, {id:1, date:2024-06-01}]	Duplicates possible, data unioned
4	Process final dataset	Duplicates handled or aggregates recalculated	Clean final dataset ready
5	End	-	Processing complete

💡 All data processed including late arrivals, final dataset ready for analysis

Variable Tracker

Variable	Start	After Step 2	After Step 3	Final
base	empty	[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}]	[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}]	[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}]
late_data	empty	[{id:1, date:2024-06-01}]	[{id:1, date:2024-06-01}]	[{id:1, date:2024-06-01}]
final	empty	empty	[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}, {id:1, date:2024-06-01}]	[Cleaned dataset without duplicates]

Key Moments - 2 Insights

Why do we check if data is late before processing?

What happens if we just append late data without cleaning duplicates?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what data does 'late_data' contain after step 2?

AAll records from source_table

BRecords with event_date within last 7 days

CRecords with event_date older than 7 days

DEmpty dataset

Concept Snapshot

Handling late-arriving data in dbt:
- Identify late data by comparing event dates
- Separate late data for special processing
- Combine base and late data carefully
- Clean duplicates or recalc aggregates
- Ensures accurate, up-to-date datasets

Full Transcript

This visual execution shows how to handle late-arriving data in dbt. First, data is loaded from the source. Then, late data is identified by filtering records older than a threshold (7 days). This late data is combined with the base data, which may cause duplicates. Finally, duplicates are cleaned or aggregates recalculated to produce a clean final dataset. Handling late data ensures that delayed records update past results correctly, avoiding errors in analysis.