dbtdata~5 mins

Handling late-arriving data in dbt - Time & Space Complexity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: Handling late-arriving data

O(n)

Understanding Time Complexity

When working with data pipelines, late-arriving data can affect how long processing takes.

We want to know how the time to handle this data grows as more late data arrives.

Scenario Under Consideration

Analyze the time complexity of the following dbt code snippet.


-- Model to merge late-arriving data
with new_data as (
  select * from source_table where event_date >= dateadd(day, -7, current_date)
),
merged as (
  select * from target_table
  union all
  select * from new_data
)
select * from merged
where event_date <= current_date

This code merges recent late-arriving data with existing data for processing.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

Primary operation: Scanning and combining rows from both existing and late-arriving data tables.
How many times: Each row in both tables is read once during the union operation.

How Execution Grows With Input

As the amount of late-arriving data grows, the total rows to process increase.

Input Size (n)	Approx. Operations
10	About 10 (existing) + 10 (late) = 20 rows processed
100	About 100 + 100 = 200 rows processed
1000	About 1000 + 1000 = 2000 rows processed

Pattern observation: The work grows roughly in direct proportion to the total rows combined.

Final Time Complexity

Time Complexity: O(n)

This means the time to handle late-arriving data grows linearly with the total number of rows processed.

Common Mistake

[X] Wrong: "Handling late-arriving data only adds a fixed small cost regardless of data size."

[OK] Correct: Because the system must read and combine all late data rows, the cost grows with how much late data arrives.

Interview Connect

Understanding how data volume affects processing time helps you design efficient pipelines and explain your choices clearly.

Self-Check

"What if we indexed the event_date column to speed up filtering? How would the time complexity change?"