dbtdata~30 mins

Clustering and partitioning in dbt - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Clustering and Partitioning with dbt

📖 Scenario: You work as a data analyst for an online store. The store's sales data is large and growing fast. To make queries faster and easier, you want to organize the data by clustering and partitioning it in the database using dbt.

🎯 Goal: You will create a dbt model that partitions the sales data by order_date and clusters it by customer_id. This will help speed up queries that filter by date and customer.

📋 What You'll Learn

Create a dbt model SQL file named sales_partitioned.sql.

Partition the table by order_date using RANGE partitioning.

Cluster the table by customer_id.

Use a simple SELECT statement from the raw sales table.

Print the final SQL code to verify the partitioning and clustering syntax.

💡 Why This Matters

🌍 Real World

Large datasets in data warehouses can be slow to query. Partitioning and clustering organize data to speed up queries and reduce costs.

💼 Career

Data analysts and engineers use dbt to build efficient data models that improve performance and maintainability in analytics workflows.

Progress0 / 4 steps

Create the base sales model

Create a dbt model SQL file named sales_partitioned.sql. Write a SELECT statement that selects all columns from the table raw_sales. Assign this SQL to a variable called sales_sql.

dbt

# Your code here

Need a hint?

Use triple quotes for multi-line SQL string. Write a simple SELECT * query from raw_sales.

Add partitioning configuration

Create a dictionary called partition_config with keys partition_by and data_type. Set partition_by to 'order_date' and data_type to 'RANGE'.

dbt

sales_sql = """
SELECT *
FROM raw_sales
"""
# Your code here

Need a hint?

Create a dictionary with the exact keys and values for partitioning.

Add clustering configuration

Create a list called cluster_by with a single string element 'customer_id'.

dbt

sales_sql = """
SELECT *
FROM raw_sales
"""

partition_config = {
    'partition_by': 'order_date',
    'data_type': 'RANGE'
}
# Your code here

Need a hint?

Create a list with the clustering column name.

Print the final dbt model SQL with partitioning and clustering

Print the final SQL string that includes the partitioning and clustering configuration. Use an f-string to combine sales_sql, partition_config['partition_by'], and cluster_by[0] in the format:
CREATE TABLE sales_partitioned PARTITION BY RANGE(order_date) CLUSTER BY customer_id AS followed by the sales_sql query.

dbt

sales_sql = """
SELECT *
FROM raw_sales
"""

partition_config = {
    'partition_by': 'order_date',
    'data_type': 'RANGE'
}

cluster_by = ['customer_id']

# Your code here

Need a hint?

Use an f-string to build the final SQL string and print it.