Why advanced testing catches subtle data issues in dbt - Performance Analysis
We want to see how the time it takes to run advanced tests in dbt grows as data gets bigger.
How does adding more data affect the time to find subtle data problems?
Analyze the time complexity of the following dbt test code.
-- Advanced test to find subtle data issues
select
user_id,
count(*) as event_count
from {{ ref('events') }}
where event_type = 'purchase'
group by user_id
having count(*) < 5
This test checks users with fewer than 5 purchase events, catching rare or unusual patterns.
Look at what repeats as data grows.
- Primary operation: Scanning all event rows to filter and group by user_id.
- How many times: Once over all events, then grouping by each user.
As the number of events grows, the test scans more rows and groups more users.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 rows scanned and grouped |
| 100 | About 100 rows scanned and grouped |
| 1000 | About 1000 rows scanned and grouped |
Pattern observation: Operations grow roughly in direct proportion to data size.
Time Complexity: O(n)
This means the time to run the test grows linearly as the number of events increases.
[X] Wrong: "Advanced tests only add a small fixed time, no matter data size."
[OK] Correct: These tests scan and group all data, so more data means more work and longer time.
Understanding how test time grows helps you explain how to keep data quality checks efficient as data grows.
"What if we added a filter that only checks recent events? How would the time complexity change?"