GCPcloud~10 mins

Dataflow for stream/batch processing in GCP - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Process Flow - Dataflow for stream/batch processing

Input Data Source

↓

Dataflow Job Start

↓

Choose Processing Mode

↓

Batch

↓

Read All Data

↓

Apply Transformations

↓

Write Output Data

↓

Job Complete or Continues

Dataflow reads data from a source, then processes it either as a batch (all at once) or stream (continuously), applies transformations, and writes results.

Execution Sample

GCP

pipeline = DataflowPipeline()
if mode == 'batch':
  data = pipeline.read_all(source)
else:
  data = pipeline.read_stream(source)
result = data.apply_transformations()
pipeline.write(result, sink)

This code sets up a Dataflow pipeline that reads data in batch or stream mode, processes it, and writes the output.

Process Table

Step	Action	Mode	Data State	Output/Result
1	Start Dataflow job	batch or stream	No data read yet	Job initialized
2	Check mode	batch	No data read yet	Batch mode selected
3	Read all data from source	batch	All data loaded	Data ready for processing
4	Apply transformations	batch	Data loaded	Transformed data ready
5	Write output data	batch	Transformed data	Data written to sink
6	Job completes	batch	Output written	Job finished successfully
1	Start Dataflow job	stream	No data read yet	Job initialized
2	Check mode	stream	No data read yet	Stream mode selected
3	Read data continuously	stream	Data arriving continuously	Data stream open
4	Apply transformations continuously	stream	Data streaming	Transformed data streaming
5	Write output data continuously	stream	Transformed data streaming	Output stream updated
6	Job runs continuously	stream	Continuous processing	Job runs until stopped

💡 Batch job ends after processing all data; stream job runs continuously until stopped.

Status Tracker

Variable	Start	After Step 3 (Batch)	After Step 4 (Batch)	After Step 5 (Batch)	After Step 3 (Stream)	After Step 4 (Stream)	After Step 5 (Stream)
mode	undefined	batch	batch	batch	stream	stream	stream
data	empty	all data loaded	transformed data	transformed data	stream open	transformed stream	transformed stream
job_state	initialized	running	running	completed	running	running	running

Key Moments - 3 Insights

Why does the batch job stop after step 6 but the stream job keeps running?

What changes in data reading between batch and stream modes?

How does the variable 'data' differ after reading in batch vs stream?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, at which step does the batch job write output data?

AStep 4

BStep 6

CStep 5

DStep 3

Concept Snapshot

Dataflow processes data in two modes:
- Batch: reads all data, processes, then finishes.
- Stream: reads data continuously, processes continuously.
Choose mode at job start.
Apply transformations to data.
Write results to output.
Batch jobs end; stream jobs run until stopped.

Full Transcript

Dataflow is a Google Cloud service that processes data either in batch or streaming mode. The job starts by reading data from a source. In batch mode, it reads all data at once, applies transformations, writes the output, and then finishes. In streaming mode, it reads data continuously as it arrives, applies transformations continuously, writes output continuously, and runs until manually stopped. Variables like 'data' and 'job_state' change differently depending on the mode. Batch jobs have a clear end, while streaming jobs run indefinitely. Understanding these steps helps in designing data processing pipelines effectively.