0
0
dbtdata~20 mins

Full refresh vs incremental in dbt - Practice Questions

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Master of Full Refresh and Incremental Models
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Full Refresh in dbt

Which statement best describes what happens during a full refresh in dbt?

Adbt updates only the changed rows in the existing table using merge operations.
Bdbt deletes the existing table and rebuilds it completely from the source data.
Cdbt only adds new rows to the existing table without modifying existing data.
Ddbt skips the model and uses cached results from the previous run.
Attempts:
2 left
💡 Hint

Think about what happens when you want to start fresh with your data model.

🧠 Conceptual
intermediate
2:00remaining
Incremental Model Behavior

What is the main advantage of using an incremental model in dbt?

AIt processes only new or changed data, reducing runtime and resource use.
BIt rebuilds the entire table every time to ensure data accuracy.
CIt automatically archives old data to a separate table.
DIt disables all data validations during the run.
Attempts:
2 left
💡 Hint

Consider how to save time when working with large datasets that update frequently.

data_output
advanced
3:00remaining
Output of Incremental Model Run

Given this incremental model SQL snippet, what will be the content of the target table after running the model twice?

-- model.sql
{{ config(materialized='incremental', unique_key='id') }}

select id, value from source_table where updated_at > (select max(updated_at) from {{ this }})

Assume source_table initially has rows with ids 1 and 2, then a new row with id 3 is added before the second run.

dbt
import pandas as pd

# Initial source_table data
source_table_1 = pd.DataFrame({'id': [1, 2], 'value': ['a', 'b'], 'updated_at': ['2024-01-01', '2024-01-02']})

# After first run, target table content
first_run = source_table_1[['id', 'value']]

# New data added
source_table_2 = pd.DataFrame({'id': [1, 2, 3], 'value': ['a', 'b', 'c'], 'updated_at': ['2024-01-01', '2024-01-02', '2024-01-03']})

# After second run, incremental adds only new row
second_run = pd.concat([first_run, source_table_2[source_table_2['id'] == 3][['id', 'value']]], ignore_index=True)

second_run
A[{'id': 1, 'value': 'a'}, {'id': 2, 'value': 'b'}, {'id': 3, 'value': 'c'}]
B[{'id': 3, 'value': 'c'}]
C[{'id': 1, 'value': 'a'}, {'id': 2, 'value': 'b'}]
D[]
Attempts:
2 left
💡 Hint

Think about how incremental models add new rows without deleting existing ones.

🔧 Debug
advanced
2:00remaining
Error in Incremental Model Unique Key

Consider this dbt incremental model configuration:

{{ config(materialized='incremental', unique_key='user_id') }}

The source data has duplicate user_id values in the new data batch. What error or issue will most likely occur when running this model?

Adbt will silently drop duplicate rows without warning.
Bdbt will fail with a syntax error due to duplicate keys.
Cdbt will raise a <strong>unique key violation</strong> error during the merge step.
Ddbt will rebuild the entire table ignoring incremental logic.
Attempts:
2 left
💡 Hint

Think about what happens if the unique key is not unique in the incremental data.

🚀 Application
expert
3:00remaining
Choosing Between Full Refresh and Incremental

You manage a dbt model that processes millions of rows daily. The source data sometimes has late-arriving updates for past dates. Which approach best balances performance and data accuracy?

ADisable incremental and run only snapshots to track changes.
BAlways use full refresh to ensure all data is accurate despite longer runtimes.
CUse incremental models without any reprocessing window and rely on source data correctness.
DUse incremental models with a window of days to reprocess recent data and full refresh weekly.
Attempts:
2 left
💡 Hint

Consider how to handle late data updates while keeping runtimes reasonable.