Consider a dbt test that checks for null values in a column user_id. The following SQL stores failures in a table. What rows will be stored?
INSERT INTO test_failures (test_name, failed_row_id)
SELECT 'not_null_user_id', id
FROM users
WHERE user_id IS NULL;Think about the WHERE clause filtering rows with user_id IS NULL.
The query inserts rows where user_id is NULL, labeling them with the test name. This stores only failing rows.
Given the table orders with 1000 rows, 50 have order_date NULL. The test stores failures with:
INSERT INTO failure_log (test_name, order_id) SELECT 'not_null_order_date', order_id FROM orders WHERE order_date IS NULL;
How many rows will failure_log have after this?
Count rows where order_date IS NULL.
The query inserts one row per order with NULL order_date, so 50 rows are stored.
Look at this SQL snippet intended to store failures where email is NULL:
INSERT INTO test_failures (test_name, failed_row_id) SELECT 'not_null_email', user_id FROM users WHERE email = NULL;
Why does it not store any rows?
Remember how SQL compares NULL values.
In SQL, email = NULL is never true. The correct check is email IS NULL to find NULL values.
customer_id?You want to store rows where customer_id is duplicated in sales. Which query does this correctly?
Think about how to find duplicates using GROUP BY and HAVING.
Option A groups by customer_id and selects those with count greater than 1, correctly identifying duplicates.
Why do data teams store test failures in dedicated tables instead of just logging errors?
Think about how stored failures help in understanding data quality issues.
Storing failures separately helps teams analyze and track problematic data rows easily, aiding debugging and quality improvement.