Slowly changing dimensions (SCD Type 2) in dbt - Time & Space Complexity
When working with slowly changing dimensions type 2 in dbt, we want to know how the processing time grows as the data size increases.
We ask: How does the time to update and insert records change when the input data grows?
Analyze the time complexity of the following dbt SQL snippet for SCD Type 2.
WITH source_data AS (
SELECT * FROM {{ ref('source_table') }}
),
current_dim AS (
SELECT * FROM {{ ref('dimension_table') }} WHERE current_flag = 1
),
updates AS (
SELECT s.*
FROM source_data s
JOIN current_dim d ON s.key = d.key
WHERE s.attribute <> d.attribute
),
new_records AS (
SELECT s.*
FROM source_data s
LEFT JOIN current_dim d ON s.key = d.key
WHERE d.key IS NULL
)
SELECT * FROM updates
UNION ALL
SELECT * FROM new_records
This code finds changed and new records to update the dimension table with history.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Joining source data with current dimension data to find changes and new records.
- How many times: Each record in source data is compared once against dimension data.
As the number of source records grows, the number of comparisons grows roughly the same.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 comparisons |
| 100 | About 100 comparisons |
| 1000 | About 1000 comparisons |
Pattern observation: The work grows linearly as input size increases.
Time Complexity: O(n)
This means the time to process grows directly in proportion to the number of input records.
[X] Wrong: "The process will take the same time no matter how many records there are."
[OK] Correct: Each new record adds work because it must be checked and possibly inserted or updated, so time grows with data size.
Understanding how data volume affects processing time helps you explain and improve data pipeline efficiency in real projects.
"What if we added an index on the key columns? How would the time complexity change?"