0
0
dbtdata~10 mins

Handling late-arriving data in dbt - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Handling late-arriving data
Data arrives
Check if data is late?
NoProcess normally
Yes
Apply late data handling logic
Update existing records or append
Recalculate aggregates if needed
Final clean dataset ready
Data arrives and is checked if late. If late, special logic updates or appends data, then aggregates are recalculated.
Execution Sample
dbt
with base as (
  select * from source_table
),
late_data as (
  select * from base where date < current_date - interval '7' day
),
final as (
  select * from base
  union all
  select * from late_data
)
select * from final
This dbt SQL snippet identifies late data older than 7 days and combines it with the base data for final processing.
Execution Table
StepActionData SampleResulting Data State
1Load source_table[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}]All data loaded
2Filter late data (date < current_date - 7)[{id:1, date:2024-06-01}]Late data isolated
3Combine base and late_data[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}, {id:1, date:2024-06-01}]Duplicates possible, data unioned
4Process final datasetDuplicates handled or aggregates recalculatedClean final dataset ready
5End-Processing complete
💡 All data processed including late arrivals, final dataset ready for analysis
Variable Tracker
VariableStartAfter Step 2After Step 3Final
baseempty[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}][{id:1, date:2024-06-01}, {id:2, date:2024-06-10}][{id:1, date:2024-06-01}, {id:2, date:2024-06-10}]
late_dataempty[{id:1, date:2024-06-01}][{id:1, date:2024-06-01}][{id:1, date:2024-06-01}]
finalemptyempty[{id:1, date:2024-06-01}, {id:2, date:2024-06-10}, {id:1, date:2024-06-01}][Cleaned dataset without duplicates]
Key Moments - 2 Insights
Why do we check if data is late before processing?
Because late data can change past results, so we handle it differently to update or append existing records as shown in step 2 of the execution_table.
What happens if we just append late data without cleaning duplicates?
Duplicates can cause wrong analysis results. Step 3 shows unioning data which may create duplicates, so step 4 cleans or recalculates aggregates.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what data does 'late_data' contain after step 2?
AAll records from source_table
BRecords with event_date within last 7 days
CRecords with event_date older than 7 days
DEmpty dataset
💡 Hint
Check the 'Data Sample' column in step 2 of execution_table
At which step does the dataset potentially contain duplicates?
AStep 2
BStep 3
CStep 1
DStep 4
💡 Hint
Look at the 'Resulting Data State' in step 3 mentioning duplicates
If we skip late data handling, what is the likely impact on the final dataset?
AFinal dataset will be missing late-arriving records
BFinal dataset will have duplicates
CFinal dataset will be empty
DNo impact, data is always complete
💡 Hint
Refer to the concept_flow where late data is checked and handled specially
Concept Snapshot
Handling late-arriving data in dbt:
- Identify late data by comparing event dates
- Separate late data for special processing
- Combine base and late data carefully
- Clean duplicates or recalc aggregates
- Ensures accurate, up-to-date datasets
Full Transcript
This visual execution shows how to handle late-arriving data in dbt. First, data is loaded from the source. Then, late data is identified by filtering records older than a threshold (7 days). This late data is combined with the base data, which may cause duplicates. Finally, duplicates are cleaned or aggregates recalculated to produce a clean final dataset. Handling late data ensures that delayed records update past results correctly, avoiding errors in analysis.