Handling late-arriving data in dbt - Time & Space Complexity
When working with data pipelines, late-arriving data can affect how long processing takes.
We want to know how the time to handle this data grows as more late data arrives.
Analyze the time complexity of the following dbt code snippet.
-- Model to merge late-arriving data
with new_data as (
select * from source_table where event_date >= dateadd(day, -7, current_date)
),
merged as (
select * from target_table
union all
select * from new_data
)
select * from merged
where event_date <= current_date
This code merges recent late-arriving data with existing data for processing.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning and combining rows from both existing and late-arriving data tables.
- How many times: Each row in both tables is read once during the union operation.
As the amount of late-arriving data grows, the total rows to process increase.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 (existing) + 10 (late) = 20 rows processed |
| 100 | About 100 + 100 = 200 rows processed |
| 1000 | About 1000 + 1000 = 2000 rows processed |
Pattern observation: The work grows roughly in direct proportion to the total rows combined.
Time Complexity: O(n)
This means the time to handle late-arriving data grows linearly with the total number of rows processed.
[X] Wrong: "Handling late-arriving data only adds a fixed small cost regardless of data size."
[OK] Correct: Because the system must read and combine all late data rows, the cost grows with how much late data arrives.
Understanding how data volume affects processing time helps you design efficient pipelines and explain your choices clearly.
"What if we indexed the event_date column to speed up filtering? How would the time complexity change?"