GCPcloud~5 mins

Data Fusion for ETL in GCP - Commands & Configuration

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Data Fusion helps you move and change data from one place to another easily. It solves the problem of combining data from many sources and preparing it for use without writing complex code.

When you want to combine sales data from different stores into one place for analysis

When you need to clean and organize customer data before loading it into a database

When you want to move data from cloud storage to a data warehouse automatically

When you want to schedule regular data updates without manual work

When you want to build data pipelines visually without coding

Config File - pipeline.json

pipeline.json

{
  "name": "example-etl-pipeline",
  "description": "A simple ETL pipeline to move and transform data",
  "config": {
    "source": {
      "type": "GCS",
      "properties": {
        "path": "gs://example-bucket/input-data/"
      }
    },
    "transform": {
      "type": "Wrangler",
      "properties": {
        "script": "parse-as-csv body '\n' ','; drop-column 'unnecessary_column';"
      }
    },
    "sink": {
      "type": "BigQuery",
      "properties": {
        "dataset": "example_dataset",
        "table": "cleaned_data"
      }
    }
  }
}

This JSON defines a Data Fusion ETL pipeline named example-etl-pipeline.

The source section tells Data Fusion where to get the data, here from a Google Cloud Storage bucket.

The transform section uses a Wrangler script to parse CSV data and remove an unwanted column.

The sink section sends the cleaned data to a BigQuery table for analysis.

Commands

This command creates a new Data Fusion instance named 'example-instance' in the us-central1 region. The 'basic' type is suitable for simple ETL tasks.

Terminal

gcloud data-fusion instances create example-instance --location=us-central1 --type=basic

Expected OutputExpected

Create request issued for: [projects/project-id/locations/us-central1/instances/example-instance] Waiting for operation [projects/project-id/locations/us-central1/operations/operation-id] to complete... Create operation finished successfully.

→

--location - Specifies the region where the Data Fusion instance will be created

→

--type - Defines the edition of Data Fusion instance (basic or enterprise)

This command uploads and creates the ETL pipeline defined in 'pipeline.json' on the 'example-instance'.

Terminal

gcloud data-fusion pipelines create example-etl-pipeline --instance=example-instance --location=us-central1 --file=pipeline.json

Expected OutputExpected

Pipeline 'example-etl-pipeline' created successfully.

→

--file - Specifies the pipeline configuration file to upload

This command starts running the ETL pipeline on the Data Fusion instance to process data as defined.

Terminal

gcloud data-fusion pipelines run example-etl-pipeline --instance=example-instance --location=us-central1

Expected OutputExpected

Pipeline run started with run ID: run-1234567890abcdef

This command lists all pipelines on the Data Fusion instance to verify the pipeline exists and check its status.

Terminal

gcloud data-fusion pipelines list --instance=example-instance --location=us-central1

Expected OutputExpected

NAME STATUS example-etl-pipeline RUNNING

Key Concept

If you remember nothing else from this pattern, remember: Data Fusion lets you build and run data pipelines visually to move and clean data without coding.

Common Mistakes

Not specifying the correct location when creating the Data Fusion instance or pipelines

The commands will fail or create resources in unexpected regions, causing confusion or errors.

Always use the --location flag with the correct region for your resources.

Uploading a pipeline JSON file with invalid or incomplete configuration

The pipeline creation will fail or the pipeline will not run as expected.

Validate the pipeline JSON file carefully and test it in the Data Fusion UI before running.

Running the pipeline before it is created or without an active Data Fusion instance

The run command will fail because the pipeline or instance does not exist.

Ensure the instance is created and the pipeline is uploaded successfully before running.

Summary

Create a Data Fusion instance to host your ETL pipelines.

Define your ETL pipeline in a JSON file specifying source, transform, and sink.

Upload the pipeline to the Data Fusion instance.

Run the pipeline to process and move your data automatically.