Bird
Raised Fist0
dbtdata~15 mins

Creating your first model in dbt - Mechanics & Internals

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Creating your first model
What is it?
Creating your first model in dbt means writing a SQL query that transforms raw data into a clean, organized table or view. This model is a building block for your data analysis and reporting. It helps you shape data so it is easier to understand and use. Models in dbt are reusable and can depend on each other to build complex data pipelines.
Why it matters
Without models, raw data stays messy and hard to analyze, making decisions slow and error-prone. Models let you automate data cleaning and transformation, saving time and reducing mistakes. They create a clear, trusted source of data for everyone in your team. This makes your data work faster, more reliable, and easier to maintain.
Where it fits
Before creating your first model, you should understand basic SQL and have dbt installed with a connection to your data warehouse. After learning to create models, you will explore testing, documentation, and building complex data pipelines with multiple models.
Mental Model
Core Idea
A dbt model is a SQL query saved as a file that creates a table or view in your data warehouse, transforming raw data into clean, usable data.
Think of it like...
Creating a dbt model is like writing a recipe that turns raw ingredients into a finished dish. The recipe (SQL query) tells the kitchen (data warehouse) how to prepare the meal (clean data) step by step.
┌───────────────┐
│ Raw Data Table│
└──────┬────────┘
       │
       ▼
┌─────────────────────┐
│ dbt Model SQL Query  │
│ (Transformation)    │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│ Clean Data Table or  │
│ View in Warehouse   │
└─────────────────────┘
Build-Up - 6 Steps
1
FoundationUnderstanding dbt Models Basics
🤔
Concept: Learn what a dbt model is and how it relates to SQL queries and tables.
A dbt model is simply a SQL file saved in your project under the 'models' folder. When you run dbt, it runs these SQL queries and creates tables or views in your data warehouse. Each model file corresponds to one table or view. Models help organize your data transformations in a clear, repeatable way.
Result
You understand that a dbt model is a SQL file that creates a table or view when run.
Knowing that models are just SQL files helps you see dbt as a tool that organizes and runs your SQL transformations automatically.
2
FoundationSetting Up Your dbt Project
🤔
Concept: Learn how to create a new dbt project and connect it to your data warehouse.
Use the command 'dbt init my_project' to create a new project folder. Then edit the 'profiles.yml' file to add your data warehouse connection details like host, user, password, and database. This setup lets dbt know where to run your models.
Result
You have a working dbt project connected to your data warehouse, ready to run models.
Understanding the connection setup is crucial because dbt needs to know where to send your SQL queries to create tables.
3
IntermediateWriting Your First Model SQL Query
🤔Before reading on: Do you think your first model should select all columns or only some? Commit to your answer.
Concept: Learn to write a simple SQL SELECT statement as your first model.
Create a file 'models/my_first_model.sql'. Write a SQL query selecting columns from an existing raw data table, for example: SELECT id, name, created_at FROM raw.customers. This query defines what data your model will produce.
Result
A SQL file that defines a model selecting specific columns from raw data.
Choosing only needed columns helps keep your data clean and efficient, avoiding unnecessary data processing.
4
IntermediateRunning Your Model with dbt
🤔Before reading on: What do you think happens when you run 'dbt run'? Commit to your answer.
Concept: Learn how to execute your model and see the results in your data warehouse.
Run the command 'dbt run' in your project folder. dbt compiles your SQL files and runs them against your data warehouse, creating tables or views. Check your warehouse to see the new table created by your model.
Result
Your model's table or view appears in the data warehouse with the transformed data.
Seeing your model's output in the warehouse connects your SQL code to real data, reinforcing how dbt automates transformations.
5
AdvancedUsing Model Configurations
🤔Before reading on: Do you think dbt models create tables or views by default? Commit to your answer.
Concept: Learn to customize how dbt builds your model using configurations like materializations.
In your model SQL file, add a config block at the top: {{ config(materialized='view') }} or {{ config(materialized='table') }}. This tells dbt whether to create a table (stored data) or a view (virtual table). Choosing materialization affects performance and storage.
Result
Your model builds as a table or view based on your configuration.
Knowing materializations lets you optimize your data pipeline for speed, storage, and freshness.
6
ExpertUnderstanding Model Dependencies and DAG
🤔Before reading on: Do you think dbt runs models in any order or follows dependencies? Commit to your answer.
Concept: Learn how dbt manages model dependencies and runs them in the correct order using a Directed Acyclic Graph (DAG).
When one model selects from another model, dbt detects this dependency. It builds a DAG to know which models depend on others. When you run 'dbt run', it runs models in order so dependencies are ready first. This prevents errors and ensures data consistency.
Result
Models run in the correct order, respecting dependencies automatically.
Understanding the DAG helps you design complex pipelines without manual ordering, making your workflows reliable and scalable.
Under the Hood
dbt reads your SQL model files and compiles them into raw SQL queries tailored for your data warehouse dialect. It analyzes the SQL to detect references to other models, building a dependency graph. When you run dbt, it executes these queries in dependency order, creating tables or views in the warehouse. dbt also manages schema changes and incremental updates if configured.
Why designed this way?
dbt was designed to bring software engineering best practices to data transformation. By treating SQL queries as code and managing dependencies automatically, it solves the problem of messy, unorganized data pipelines. Alternatives like manual SQL scripts or ETL tools lacked version control, testing, and modularity, which dbt provides.
┌───────────────┐
│ Model SQL File│
└──────┬────────┘
       │
       ▼
┌─────────────────────┐
│ dbt Compiler        │
│ - Parses SQL        │
│ - Detects refs      │
│ - Builds DAG        │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│ Data Warehouse      │
│ - Runs SQL Queries  │
│ - Creates Tables/Views│
└─────────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does dbt automatically create tables for models by default? Commit yes or no.
Common Belief:dbt always creates tables when you run a model.
Tap to reveal reality
Reality:By default, dbt creates views, not tables, unless you specify materialization as 'table'.
Why it matters:Assuming tables are created can lead to performance issues or confusion about data storage and refresh behavior.
Quick: Do you think dbt models can only select from raw tables, not other models? Commit your answer.
Common Belief:Models can only use raw tables as data sources, not other models.
Tap to reveal reality
Reality:Models can select from other models, creating a chain of transformations managed by dbt.
Why it matters:Not knowing this limits your ability to build modular, layered data pipelines.
Quick: Does running 'dbt run' execute models in the order they appear in the folder? Commit yes or no.
Common Belief:dbt runs models in alphabetical or folder order.
Tap to reveal reality
Reality:dbt runs models based on dependency order, ensuring upstream models run before downstream ones.
Why it matters:Misunderstanding this can cause errors if you rely on manual ordering.
Quick: Can you edit models directly in the data warehouse after dbt runs? Commit yes or no.
Common Belief:You can safely edit tables created by dbt directly in the warehouse.
Tap to reveal reality
Reality:Editing dbt-managed tables directly can cause inconsistencies because dbt overwrites them on the next run.
Why it matters:Direct edits can be lost or cause data errors, breaking the pipeline.
Expert Zone
1
dbt's dependency graph is built by parsing SQL references, so using dynamic SQL or macros can affect dependency detection and require explicit declarations.
2
Materializations like incremental models allow partial data updates, improving performance but requiring careful handling of unique keys and update logic.
3
dbt supports hooks and operations to run custom SQL before or after models, enabling advanced workflows like permissions or data validations.
When NOT to use
dbt models are not ideal for real-time streaming data transformations or complex procedural logic better handled by dedicated ETL tools or data processing frameworks like Apache Spark.
Production Patterns
In production, teams organize models into folders by domain, use tests to validate data quality, schedule dbt runs with orchestration tools, and use documentation features to maintain clear data lineage.
Connections
Software Engineering CI/CD
dbt applies software engineering practices like version control and automated runs to data transformations.
Understanding CI/CD helps grasp how dbt automates testing and deployment of data models, improving reliability.
Data Warehousing
dbt models transform raw data inside a data warehouse to create clean, analysis-ready tables.
Knowing data warehouse concepts clarifies why dbt models focus on SQL transformations and materializations.
Recipe Writing in Cooking
Both involve step-by-step instructions transforming raw ingredients into a finished product.
This cross-domain view highlights the importance of clear, repeatable instructions for consistent results.
Common Pitfalls
#1Running dbt without setting up the data warehouse connection.
Wrong approach:dbt run
Correct approach:Edit profiles.yml with correct connection details before running 'dbt run'.
Root cause:Not understanding that dbt needs connection info to execute SQL in the warehouse.
#2Writing a model SQL that selects from a non-existent table or model.
Wrong approach:SELECT * FROM raw.non_existing_table
Correct approach:SELECT * FROM raw.existing_table
Root cause:Not verifying source tables or models exist before referencing them.
#3Editing dbt-created tables directly in the warehouse after running models.
Wrong approach:UPDATE analytics.my_first_model SET column = 'value' WHERE id = 1;
Correct approach:Make changes by editing the model SQL and re-running dbt.
Root cause:Misunderstanding that dbt manages tables and overwrites them on runs.
Key Takeaways
A dbt model is a SQL file that transforms raw data into clean tables or views in your data warehouse.
Models are organized, reusable, and run automatically by dbt, saving time and reducing errors.
Understanding model dependencies lets you build complex pipelines that run in the correct order.
Configuring materializations affects how data is stored and refreshed, impacting performance.
dbt brings software engineering best practices to data transformation, making data pipelines reliable and maintainable.

Practice

(1/5)
1. What is the main purpose of a dbt model?
easy
A. To write Python scripts for data analysis
B. To store raw data without changes
C. To create visual dashboards
D. To transform raw data into clean, usable tables

Solution

  1. Step 1: Understand the role of dbt models

    dbt models are SQL files that transform raw data into clean tables for analysis.
  2. Step 2: Compare options with this role

    Only To transform raw data into clean, usable tables describes transforming raw data into clean tables, which matches the purpose of dbt models.
  3. Final Answer:

    To transform raw data into clean, usable tables -> Option D
  4. Quick Check:

    dbt model purpose = transform raw data [OK]
Hint: Remember: dbt models clean and transform data [OK]
Common Mistakes:
  • Confusing models with dashboards
  • Thinking models store raw data unchanged
  • Assuming models are Python scripts
2. Which of the following is the correct way to define a simple dbt model SQL file?
easy
A. SELECT * FROM raw_data
B. CREATE MODEL my_model AS SELECT * FROM raw_data
C. dbt run SELECT * FROM raw_data
D. INSERT INTO model SELECT * FROM raw_data

Solution

  1. Step 1: Recall dbt model syntax

    A dbt model is a SQL SELECT statement saved as a .sql file in the models folder.
  2. Step 2: Evaluate each option

    SELECT * FROM raw_data is a simple SELECT statement, which is the correct way to define a model. Options B, C, and D use incorrect syntax or commands not used in dbt model files.
  3. Final Answer:

    SELECT * FROM raw_data -> Option A
  4. Quick Check:

    dbt model = simple SELECT statement [OK]
Hint: dbt models are just SELECT queries saved as files [OK]
Common Mistakes:
  • Using CREATE MODEL syntax (not valid in dbt)
  • Trying to run dbt commands inside SQL files
  • Using INSERT statements instead of SELECT
3. Given the following dbt model SQL code saved as models/my_first_model.sql:
SELECT id, name FROM raw_customers WHERE active = true
What will be the output when you run dbt run?
medium
A. Nothing happens because dbt run does not create models
B. An error because of missing CREATE TABLE statement
C. A new table or view named my_first_model with active customers only
D. The raw_customers table will be deleted

Solution

  1. Step 1: Understand what dbt run does

    Running dbt run executes model SQL files and creates tables or views with the model name.
  2. Step 2: Analyze the model SQL

    The model selects id and name from raw_customers where active is true, so the output table will contain only active customers.
  3. Final Answer:

    A new table or view named my_first_model with active customers only -> Option C
  4. Quick Check:

    dbt run creates model tables = filtered active customers [OK]
Hint: dbt run creates tables from your SELECT queries [OK]
Common Mistakes:
  • Expecting CREATE TABLE in model SQL
  • Thinking dbt deletes source tables
  • Believing dbt run does nothing
4. You wrote this dbt model SQL file named models/customer_summary.sql:
SELECT customer_id, order_id, COUNT(*) AS orders_count
FROM orders
GROUP BY customer_id
When you run dbt run, you get an error. What is the most likely cause?
medium
A. Missing a semicolon at the end of the SQL statement
B. The GROUP BY column does not match the SELECT columns
C. The SELECT statement is missing a FROM clause
D. The model file is not saved in the models folder

Solution

  1. Step 1: Recall GROUP BY rules

    When using GROUP BY, all non-aggregated columns in SELECT must either be aggregated or included in GROUP BY.
  2. Step 2: Analyze the SELECT and GROUP BY columns

    SELECT has customer_id (in GROUP BY), order_id (neither aggregated nor grouped), COUNT(*) (aggregated). Thus, GROUP BY does not match SELECT columns.
  3. Final Answer:

    The GROUP BY column does not match the SELECT columns -> Option B
  4. Quick Check:

    GROUP BY must include all non-aggregated SELECT columns [OK]
Hint: Ensure all non-aggregated SELECT columns are in GROUP BY [OK]
Common Mistakes:
  • Forgetting to include non-aggregated columns in GROUP BY
  • Assuming semicolon is required
  • Saving model outside models folder
5. You want to create a dbt model that shows the total sales per product category, but only for categories with total sales above 1000. Which SQL code correctly implements this in your model file?
hard
A. SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING SUM(sales) > 1000
B. SELECT category, SUM(sales) AS total_sales FROM sales_data WHERE total_sales > 1000 GROUP BY category
C. SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING total_sales > 1000
D. SELECT category, SUM(sales) AS total_sales FROM sales_data WHERE SUM(sales) > 1000 GROUP BY category

Solution

  1. Step 1: Understand filtering on aggregated values

    To filter groups by aggregated values, use HAVING with the aggregate function, not WHERE.
  2. Step 2: Analyze each option

    SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING total_sales > 1000 uses HAVING total_sales > 1000, but total_sales is an alias and cannot be used directly in HAVING in many SQL dialects. SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING SUM(sales) > 1000 uses HAVING SUM(sales) > 1000, which is correct. Options B and D incorrectly use WHERE with aggregate functions, which is invalid.
  3. Final Answer:

    SELECT category, SUM(sales) AS total_sales FROM sales_data GROUP BY category HAVING SUM(sales) > 1000 -> Option A
  4. Quick Check:

    Use HAVING with aggregate functions to filter groups [OK]
Hint: Use HAVING with aggregate functions, not WHERE [OK]
Common Mistakes:
  • Using WHERE to filter aggregated results
  • Using alias names in HAVING clause
  • Forgetting GROUP BY when aggregating