0
0
GCPcloud~30 mins

Data Fusion for ETL in GCP - Mini Project: Build & Apply

Choose your learning style9 modes available
Data Fusion for ETL
📖 Scenario: You work as a data engineer at a retail company. Your team wants to automate the process of extracting sales data from a Cloud Storage bucket, transforming it by filtering only the sales above a certain amount, and loading the filtered data into BigQuery for analysis.You will use Google Cloud Data Fusion to build this ETL pipeline step-by-step.
🎯 Goal: Build a simple ETL pipeline in Google Cloud Data Fusion that reads sales data from Cloud Storage, filters sales above a threshold, and writes the results to a BigQuery table.
📋 What You'll Learn
Create a Cloud Storage source plugin configuration with the exact bucket name and file path
Add a configuration variable for the sales amount threshold
Use a Wrangler or Transform plugin to filter sales records above the threshold
Configure a BigQuery sink plugin with the exact dataset and table name
💡 Why This Matters
🌍 Real World
ETL pipelines are essential for moving and transforming data in cloud environments to prepare it for analysis and reporting.
💼 Career
Data engineers and cloud architects often build and configure ETL pipelines using tools like Google Cloud Data Fusion to automate data workflows.
Progress0 / 4 steps
1
Create Cloud Storage Source Configuration
Create a Cloud Storage source plugin configuration dictionary called cloud_storage_source with these exact keys and values: "name": "CloudStorageSource", "type": "cloudstorage", "properties" containing "referenceName": "sales_data_source", "path": "gs://retail-data-bucket/sales/2024/sales_data.csv", and "format": "csv".
GCP
Need a hint?

Use a Python dictionary with nested dictionaries for properties.

2
Add Sales Threshold Configuration
Create a variable called sales_threshold and set it to the integer 1000. This will be used to filter sales records above this amount.
GCP
Need a hint?

Just assign the number 1000 to the variable named sales_threshold.

3
Create Filter Transform Configuration
Create a transform plugin configuration dictionary called filter_transform with these exact keys and values: "name": "FilterTransform", "type": "transform", "properties" containing "condition": "${sales_amount} >= {sales_threshold}". Use the exact string with sales_amount and sales_threshold as shown.
GCP
Need a hint?

Use a dictionary with nested properties. The condition string must match exactly.

4
Configure BigQuery Sink Plugin
Create a BigQuery sink plugin configuration dictionary called bigquery_sink with these exact keys and values: "name": "BigQuerySink", "type": "bigquery", "properties" containing "referenceName": "sales_data_sink", "dataset": "retail_analytics", and "table": "filtered_sales".
GCP
Need a hint?

Use a dictionary with nested properties for BigQuery sink configuration.