0
0
GCPcloud~5 mins

Data Fusion for ETL in GCP - Commands & Configuration

Choose your learning style9 modes available
Introduction
Data Fusion helps you move and change data from one place to another easily. It solves the problem of combining data from many sources and preparing it for use without writing complex code.
When you want to combine sales data from different stores into one place for analysis
When you need to clean and organize customer data before loading it into a database
When you want to move data from cloud storage to a data warehouse automatically
When you want to schedule regular data updates without manual work
When you want to build data pipelines visually without coding
Config File - pipeline.json
pipeline.json
{
  "name": "example-etl-pipeline",
  "description": "A simple ETL pipeline to move and transform data",
  "config": {
    "source": {
      "type": "GCS",
      "properties": {
        "path": "gs://example-bucket/input-data/"
      }
    },
    "transform": {
      "type": "Wrangler",
      "properties": {
        "script": "parse-as-csv body '\n' ','; drop-column 'unnecessary_column';"
      }
    },
    "sink": {
      "type": "BigQuery",
      "properties": {
        "dataset": "example_dataset",
        "table": "cleaned_data"
      }
    }
  }
}

This JSON defines a Data Fusion ETL pipeline named example-etl-pipeline.

The source section tells Data Fusion where to get the data, here from a Google Cloud Storage bucket.

The transform section uses a Wrangler script to parse CSV data and remove an unwanted column.

The sink section sends the cleaned data to a BigQuery table for analysis.

Commands
This command creates a new Data Fusion instance named 'example-instance' in the us-central1 region. The 'basic' type is suitable for simple ETL tasks.
Terminal
gcloud data-fusion instances create example-instance --location=us-central1 --type=basic
Expected OutputExpected
Create request issued for: [projects/project-id/locations/us-central1/instances/example-instance] Waiting for operation [projects/project-id/locations/us-central1/operations/operation-id] to complete... Create operation finished successfully.
--location - Specifies the region where the Data Fusion instance will be created
--type - Defines the edition of Data Fusion instance (basic or enterprise)
This command uploads and creates the ETL pipeline defined in 'pipeline.json' on the 'example-instance'.
Terminal
gcloud data-fusion pipelines create example-etl-pipeline --instance=example-instance --location=us-central1 --file=pipeline.json
Expected OutputExpected
Pipeline 'example-etl-pipeline' created successfully.
--file - Specifies the pipeline configuration file to upload
This command starts running the ETL pipeline on the Data Fusion instance to process data as defined.
Terminal
gcloud data-fusion pipelines run example-etl-pipeline --instance=example-instance --location=us-central1
Expected OutputExpected
Pipeline run started with run ID: run-1234567890abcdef
This command lists all pipelines on the Data Fusion instance to verify the pipeline exists and check its status.
Terminal
gcloud data-fusion pipelines list --instance=example-instance --location=us-central1
Expected OutputExpected
NAME STATUS example-etl-pipeline RUNNING
Key Concept

If you remember nothing else from this pattern, remember: Data Fusion lets you build and run data pipelines visually to move and clean data without coding.

Common Mistakes
Not specifying the correct location when creating the Data Fusion instance or pipelines
The commands will fail or create resources in unexpected regions, causing confusion or errors.
Always use the --location flag with the correct region for your resources.
Uploading a pipeline JSON file with invalid or incomplete configuration
The pipeline creation will fail or the pipeline will not run as expected.
Validate the pipeline JSON file carefully and test it in the Data Fusion UI before running.
Running the pipeline before it is created or without an active Data Fusion instance
The run command will fail because the pipeline or instance does not exist.
Ensure the instance is created and the pipeline is uploaded successfully before running.
Summary
Create a Data Fusion instance to host your ETL pipelines.
Define your ETL pipeline in a JSON file specifying source, transform, and sink.
Upload the pipeline to the Data Fusion instance.
Run the pipeline to process and move your data automatically.