In data projects, simple tests check for obvious problems like missing values. Advanced tests go deeper. Why do advanced tests catch subtle data issues better?
Think about what subtle data issues might look like beyond missing or null values.
Advanced tests look for complex data patterns, inconsistencies, or unexpected changes that simple tests miss. This helps catch subtle issues early.
Given this dbt test SQL snippet, what is the output if the 'user_id' column has duplicates?
select user_id, count(*) as count from {{ ref('users') }} group by user_id having count > 1
Look at the GROUP BY and HAVING clause to understand what rows are returned.
The query groups by user_id and returns only those with count greater than 1, so duplicates are listed.
Consider a dbt test that compares the average order amount this month to last month and flags if the difference is over 20%. What would the test output be if the average changed from 100 to 130?
with last_month as (select avg(order_amount) as avg_last from orders where order_date between '2024-04-01' and '2024-04-30'), this_month as (select avg(order_amount) as avg_this from orders where order_date between '2024-05-01' and '2024-05-31') select case when abs(avg_this - avg_last)/avg_last > 0.2 then 'fail' else 'pass' end as test_result from last_month, this_month
Calculate the percentage change between 100 and 130.
The change is (130-100)/100 = 0.3 or 30%, which is greater than 20%, so the test fails.
Find the error in this dbt test SQL that checks for nulls in the 'email' column:
select * from {{ ref('customers') }} where email = NULLRemember how SQL treats NULL comparisons.
In SQL, 'email = NULL' is never true. To check for nulls, use 'email IS NULL'.
You want to detect subtle data issues like unexpected shifts in customer demographics over time. Which advanced dbt test approach is best?
Think about how to detect subtle changes, not just obvious errors.
Comparing statistical distributions over time helps catch subtle shifts in data that simple tests miss.