0
0
GCPcloud~30 mins

Data pipeline patterns in GCP - Mini Project: Build & Apply

Choose your learning style9 modes available
Data Pipeline Patterns on Google Cloud Platform
📖 Scenario: You work for a company that collects sales data from multiple stores. You want to build a simple data pipeline on Google Cloud Platform (GCP) to collect, process, and store this data efficiently.This pipeline will help the company analyze sales trends and make better decisions.
🎯 Goal: Build a basic data pipeline on GCP using Cloud Storage, Pub/Sub, Dataflow, and BigQuery.You will create the initial data source, configure a Pub/Sub topic, write a Dataflow pipeline to process messages, and set up a BigQuery table to store the results.
📋 What You'll Learn
Create a Cloud Storage bucket to hold raw sales data files
Create a Pub/Sub topic to receive messages about new data
Write a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery
Create a BigQuery table to store processed sales data
💡 Why This Matters
🌍 Real World
Companies often collect data from many sources and need to process it in real-time or batch to gain insights. This project shows a simple way to build such a pipeline on GCP.
💼 Career
Understanding how to build data pipelines on cloud platforms like GCP is essential for roles in data engineering, cloud architecture, and analytics.
Progress0 / 4 steps
1
Create a Cloud Storage bucket for raw sales data
Create a Cloud Storage bucket named sales-raw-data in the us-central1 region to hold raw sales data files.
GCP
Need a hint?

Use the gcloud storage buckets create command with the --location flag.

2
Create a Pub/Sub topic for new data notifications
Create a Pub/Sub topic named sales-data-topic to receive messages when new sales data files arrive.
GCP
Need a hint?

Use the gcloud pubsub topics create command with the topic name.

3
Write a Dataflow pipeline to process Pub/Sub messages
Write a Dataflow pipeline in Python that reads messages from the Pub/Sub topic sales-data-topic and writes the processed data to a BigQuery table named sales_dataset.sales_table. Use the Apache Beam SDK and specify the project as my-gcp-project.
GCP
Need a hint?

Use Apache Beam's ReadFromPubSub and WriteToBigQuery transforms with the correct topic and table names.

4
Create a BigQuery table to store processed sales data
Create a BigQuery dataset named sales_dataset and a table named sales_table with columns store_id (STRING), date (STRING), and sales (FLOAT) in the project my-gcp-project.
GCP
Need a hint?

Use gcloud bigquery datasets create and gcloud bigquery tables create with the correct schema.