Which statement best describes what happens during a full refresh in dbt?
Think about what happens when you want to start fresh with your data model.
A full refresh means dbt drops the existing table and rebuilds it entirely from the source data. This ensures the table is fully up to date but can be slower for large datasets.
What is the main advantage of using an incremental model in dbt?
Consider how to save time when working with large datasets that update frequently.
Incremental models update only new or changed data, which makes runs faster and uses fewer resources compared to rebuilding the entire table.
Given this incremental model SQL snippet, what will be the content of the target table after running the model twice?
-- model.sql
{{ config(materialized='incremental', unique_key='id') }}
select id, value from source_table where updated_at > (select max(updated_at) from {{ this }})Assume source_table initially has rows with ids 1 and 2, then a new row with id 3 is added before the second run.
import pandas as pd # Initial source_table data source_table_1 = pd.DataFrame({'id': [1, 2], 'value': ['a', 'b'], 'updated_at': ['2024-01-01', '2024-01-02']}) # After first run, target table content first_run = source_table_1[['id', 'value']] # New data added source_table_2 = pd.DataFrame({'id': [1, 2, 3], 'value': ['a', 'b', 'c'], 'updated_at': ['2024-01-01', '2024-01-02', '2024-01-03']}) # After second run, incremental adds only new row second_run = pd.concat([first_run, source_table_2[source_table_2['id'] == 3][['id', 'value']]], ignore_index=True) second_run
Think about how incremental models add new rows without deleting existing ones.
After the first run, the table has rows with ids 1 and 2. The second run adds only the new row with id 3, so the table contains all three rows.
Consider this dbt incremental model configuration:
{{ config(materialized='incremental', unique_key='user_id') }}The source data has duplicate user_id values in the new data batch. What error or issue will most likely occur when running this model?
Think about what happens if the unique key is not unique in the incremental data.
If the unique key column has duplicates in the new data, the merge operation will fail because it cannot uniquely identify rows to update or insert.
You manage a dbt model that processes millions of rows daily. The source data sometimes has late-arriving updates for past dates. Which approach best balances performance and data accuracy?
Consider how to handle late data updates while keeping runtimes reasonable.
Incremental models with a reprocessing window allow updating recent data that might have changed, while full refreshes less frequently ensure overall accuracy without excessive runtime every day.