0
0
GCPcloud~30 mins

Dataflow for stream/batch processing in GCP - Mini Project: Build & Apply

Choose your learning style9 modes available
Dataflow for stream/batch processing
📖 Scenario: You work for a company that collects website visit data. You want to process this data using Google Cloud Dataflow. Sometimes data comes in batches, and sometimes it streams in real-time. You will create a simple Dataflow pipeline configuration to handle both batch and streaming data.
🎯 Goal: Build a Google Cloud Dataflow pipeline configuration that can run in both batch and streaming modes by setting up the input source, pipeline options, and the runner configuration.
📋 What You'll Learn
Create a pipeline options dictionary with the project ID and region
Add a boolean configuration to specify streaming mode
Configure the input source as a Pub/Sub subscription for streaming or a Cloud Storage path for batch
Complete the pipeline options with the runner and input source
💡 Why This Matters
🌍 Real World
Dataflow pipelines are used to process large amounts of data in real-time or batch mode for analytics, monitoring, and reporting.
💼 Career
Understanding how to configure Dataflow pipelines is essential for cloud engineers and data engineers working with Google Cloud Platform.
Progress0 / 4 steps
1
Create initial pipeline options dictionary
Create a dictionary called pipeline_options with these exact entries: 'project': 'my-gcp-project' and 'region': 'us-central1'.
GCP
Need a hint?

Use a Python dictionary with keys 'project' and 'region' and their exact values.

2
Add streaming mode configuration
Add a boolean key 'streaming' with the value True to the existing pipeline_options dictionary.
GCP
Need a hint?

Add the key 'streaming' with value True to the dictionary using square brackets.

3
Configure input source based on streaming mode
Create a variable called input_source. Set it to the string 'projects/my-gcp-project/subscriptions/my-subscription' if pipeline_options['streaming'] is True. Otherwise, set it to the string 'gs://my-bucket/data/*.json'.
GCP
Need a hint?

Use a conditional expression to assign input_source based on the streaming flag.

4
Complete pipeline options with runner and input source
Add the key 'runner' with value 'DataflowRunner' and the key 'input' with the value of input_source to the pipeline_options dictionary.
GCP
Need a hint?

Add the keys 'runner' and 'input' to the dictionary with the specified values.