Creating your first model in dbt - Performance & Efficiency
Start learning this pattern below
Jump into concepts and practice - no test required
When creating your first model in dbt, it is helpful to understand how the time to run your model grows as your data grows.
We want to know how the work dbt does changes when the input data size changes.
Analyze the time complexity of the following dbt model SQL code.
select
customer_id,
count(order_id) as total_orders
from {{ ref('raw_orders') }}
group by customer_id
This code creates a model that counts orders per customer from a raw orders table.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning all rows in the raw_orders table.
- How many times: Once over all rows to group and count orders per customer.
As the number of orders grows, the time to scan and group them grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 rows scanned and grouped |
| 100 | About 100 rows scanned and grouped |
| 1000 | About 1000 rows scanned and grouped |
Pattern observation: The work grows steadily as the number of rows grows.
Time Complexity: O(n)
This means the time to run the model grows roughly in direct proportion to the number of rows in the input table.
[X] Wrong: "Grouping data is instant and does not depend on input size."
[OK] Correct: Grouping requires looking at each row to count, so it takes more time as data grows.
Understanding how data size affects model run time helps you write efficient dbt models and explain your reasoning clearly.
"What if we added a filter to only include recent orders? How would that change the time complexity?"
Practice
Solution
Step 1: Understand the role of dbt models
dbt models are SQL files that transform raw data into clean tables for analysis.Step 2: Compare options with this role
Only To transform raw data into clean, usable tables describes transforming raw data into clean tables, which matches the purpose of dbt models.Final Answer:
To transform raw data into clean, usable tables -> Option DQuick Check:
dbt model purpose = transform raw data [OK]
- Confusing models with dashboards
- Thinking models store raw data unchanged
- Assuming models are Python scripts
Solution
Step 1: Recall dbt model syntax
A dbt model is a SQL SELECT statement saved as a .sql file in the models folder.Step 2: Evaluate each option
SELECT * FROM raw_data is a simple SELECT statement, which is the correct way to define a model. Options B, C, and D use incorrect syntax or commands not used in dbt model files.Final Answer:
SELECT * FROM raw_data -> Option AQuick Check:
dbt model = simple SELECT statement [OK]
- Using CREATE MODEL syntax (not valid in dbt)
- Trying to run dbt commands inside SQL files
- Using INSERT statements instead of SELECT
models/my_first_model.sql:
SELECT id, name FROM raw_customers WHERE active = trueWhat will be the output when you run
dbt run?Solution
Step 1: Understand what dbt run does
Runningdbt runexecutes model SQL files and creates tables or views with the model name.Step 2: Analyze the model SQL
The model selects id and name from raw_customers where active is true, so the output table will contain only active customers.Final Answer:
A new table or view named my_first_model with active customers only -> Option CQuick Check:
dbt run creates model tables = filtered active customers [OK]
- Expecting CREATE TABLE in model SQL
- Thinking dbt deletes source tables
- Believing dbt run does nothing
models/customer_summary.sql:
SELECT customer_id, order_id, COUNT(*) AS orders_count FROM orders GROUP BY customer_idWhen you run
dbt run, you get an error. What is the most likely cause?Solution
Step 1: Recall GROUP BY rules
When using GROUP BY, all non-aggregated columns in SELECT must either be aggregated or included in GROUP BY.Step 2: Analyze the SELECT and GROUP BY columns
SELECT has customer_id (in GROUP BY), order_id (neither aggregated nor grouped), COUNT(*) (aggregated). Thus, GROUP BY does not match SELECT columns.Final Answer:
The GROUP BY column does not match the SELECT columns -> Option BQuick Check:
GROUP BY must include all non-aggregated SELECT columns [OK]
- Forgetting to include non-aggregated columns in GROUP BY
- Assuming semicolon is required
- Saving model outside models folder
Solution
Step 1: Understand filtering on aggregated values
To filter groups by aggregated values, use HAVING with the aggregate function, not WHERE.Step 2: Analyze each option
SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING total_sales > 1000 uses HAVING total_sales > 1000, but total_sales is an alias and cannot be used directly in HAVING in many SQL dialects. SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING SUM(sales) > 1000 uses HAVING SUM(sales) > 1000, which is correct. Options B and D incorrectly use WHERE with aggregate functions, which is invalid.Final Answer:
SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING SUM(sales) > 1000 -> Option AQuick Check:
Use HAVING with aggregate functions to filter groups [OK]
- Using WHERE to filter aggregated results
- Using alias names in HAVING clause
- Forgetting GROUP BY when aggregating
