GCPcloud~30 mins

Data Fusion for ETL in GCP - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Data Fusion for ETL

📖 Scenario: You work as a data engineer at a retail company. Your team wants to automate the process of extracting sales data from a Cloud Storage bucket, transforming it by filtering only the sales above a certain amount, and loading the filtered data into BigQuery for analysis.You will use Google Cloud Data Fusion to build this ETL pipeline step-by-step.

🎯 Goal: Build a simple ETL pipeline in Google Cloud Data Fusion that reads sales data from Cloud Storage, filters sales above a threshold, and writes the results to a BigQuery table.

📋 What You'll Learn

Create a Cloud Storage source plugin configuration with the exact bucket name and file path

Add a configuration variable for the sales amount threshold

Use a Wrangler or Transform plugin to filter sales records above the threshold

Configure a BigQuery sink plugin with the exact dataset and table name

💡 Why This Matters

🌍 Real World

ETL pipelines are essential for moving and transforming data in cloud environments to prepare it for analysis and reporting.

💼 Career

Data engineers and cloud architects often build and configure ETL pipelines using tools like Google Cloud Data Fusion to automate data workflows.

Progress0 / 4 steps

Create Cloud Storage Source Configuration

Create a Cloud Storage source plugin configuration dictionary called cloud_storage_source with these exact keys and values: "name": "CloudStorageSource", "type": "cloudstorage", "properties" containing "referenceName": "sales_data_source", "path": "gs://retail-data-bucket/sales/2024/sales_data.csv", and "format": "csv".

GCP

# Your code here

Need a hint?

Use a Python dictionary with nested dictionaries for properties.

Add Sales Threshold Configuration

Create a variable called sales_threshold and set it to the integer 1000. This will be used to filter sales records above this amount.

GCP

cloud_storage_source = {
    "name": "CloudStorageSource",
    "type": "cloudstorage",
    "properties": {
        "referenceName": "sales_data_source",
        "path": "gs://retail-data-bucket/sales/2024/sales_data.csv",
        "format": "csv"
    }
}
# Your code here

Need a hint?

Just assign the number 1000 to the variable named sales_threshold.

Create Filter Transform Configuration

Create a transform plugin configuration dictionary called filter_transform with these exact keys and values: "name": "FilterTransform", "type": "transform", "properties" containing "condition": "${sales_amount} >= {sales_threshold}". Use the exact string with sales_amount and sales_threshold as shown.

GCP

cloud_storage_source = {
    "name": "CloudStorageSource",
    "type": "cloudstorage",
    "properties": {
        "referenceName": "sales_data_source",
        "path": "gs://retail-data-bucket/sales/2024/sales_data.csv",
        "format": "csv"
    }
}
sales_threshold = 1000
# Your code here

Need a hint?

Use a dictionary with nested properties. The condition string must match exactly.

Configure BigQuery Sink Plugin

Create a BigQuery sink plugin configuration dictionary called bigquery_sink with these exact keys and values: "name": "BigQuerySink", "type": "bigquery", "properties" containing "referenceName": "sales_data_sink", "dataset": "retail_analytics", and "table": "filtered_sales".

GCP

cloud_storage_source = {
    "name": "CloudStorageSource",
    "type": "cloudstorage",
    "properties": {
        "referenceName": "sales_data_source",
        "path": "gs://retail-data-bucket/sales/2024/sales_data.csv",
        "format": "csv"
    }
}
sales_threshold = 1000
filter_transform = {
    "name": "FilterTransform",
    "type": "transform",
    "properties": {
        "condition": "${sales_amount} >= {sales_threshold}"
    }
}
# Your code here

Need a hint?

Use a dictionary with nested properties for BigQuery sink configuration.