Late-arriving data can cause issues in analytics and reporting. Which of the following best explains why handling late-arriving data is important?
Think about how late data affects historical analysis and accuracy.
Handling late-arriving data is important because it allows the pipeline to update historical records accurately, ensuring reports reflect all relevant data even if it arrives after the initial processing.
Consider a dbt incremental model that uses is_incremental() to update records. What will be the output after running this model twice if late-arriving data for an existing date is included in the second run?
with source_data as ( select * from {{ ref('raw_events') }} ), updates as ( select * from source_data where event_date >= (select max(event_date) from {{ this }}) ) select * from updates
Think about how incremental models handle data for existing keys.
Incremental models with proper filtering update existing records for late-arriving data and append new records, ensuring data stays current.
You have a table with 1000 rows for dates Jan 1-10. On Jan 11, 50 late-arriving rows for Jan 5 arrive. After running a dbt incremental model that merges late data correctly, how many rows will the table have?
Consider how merging late-arriving data affects existing rows.
When late-arriving data is merged correctly, it replaces or updates existing rows for the same date, so total row count remains the same.
Review the following dbt model code snippet meant to handle late-arriving data. What error will occur when running it?
{{ config(materialized='incremental') }}
select * from {{ ref('raw_events') }}
where event_date > (select max(event_date) from {{ this }})Think about the filter condition and how it handles data equal to max date.
The filter uses > max(event_date), so rows with event_date equal to max(event_date) are excluded, causing late-arriving data for that date to be ignored.
You want to ensure your dbt incremental model correctly updates records when late-arriving data comes in for any date, including past dates. Which approach below is best?
Think about how to include late-arriving data for any date already in the table without full refresh.
Filtering source data with event_date >= min(event_date) in the existing table allows the incremental model to merge late-arriving data for all dates present, updating historical data efficiently.