0
0
GCPcloud~10 mins

Dataflow for stream/batch processing in GCP - Step-by-Step Execution

Choose your learning style9 modes available
Process Flow - Dataflow for stream/batch processing
Input Data Source
Dataflow Job Start
Choose Processing Mode
Batch
Read All Data
Apply Transformations
Write Output Data
Job Complete or Continues
Dataflow reads data from a source, then processes it either as a batch (all at once) or stream (continuously), applies transformations, and writes results.
Execution Sample
GCP
pipeline = DataflowPipeline()
if mode == 'batch':
  data = pipeline.read_all(source)
else:
  data = pipeline.read_stream(source)
result = data.apply_transformations()
pipeline.write(result, sink)
This code sets up a Dataflow pipeline that reads data in batch or stream mode, processes it, and writes the output.
Process Table
StepActionModeData StateOutput/Result
1Start Dataflow jobbatch or streamNo data read yetJob initialized
2Check modebatchNo data read yetBatch mode selected
3Read all data from sourcebatchAll data loadedData ready for processing
4Apply transformationsbatchData loadedTransformed data ready
5Write output databatchTransformed dataData written to sink
6Job completesbatchOutput writtenJob finished successfully
1Start Dataflow jobstreamNo data read yetJob initialized
2Check modestreamNo data read yetStream mode selected
3Read data continuouslystreamData arriving continuouslyData stream open
4Apply transformations continuouslystreamData streamingTransformed data streaming
5Write output data continuouslystreamTransformed data streamingOutput stream updated
6Job runs continuouslystreamContinuous processingJob runs until stopped
💡 Batch job ends after processing all data; stream job runs continuously until stopped.
Status Tracker
VariableStartAfter Step 3 (Batch)After Step 4 (Batch)After Step 5 (Batch)After Step 3 (Stream)After Step 4 (Stream)After Step 5 (Stream)
modeundefinedbatchbatchbatchstreamstreamstream
dataemptyall data loadedtransformed datatransformed datastream opentransformed streamtransformed stream
job_stateinitializedrunningrunningcompletedrunningrunningrunning
Key Moments - 3 Insights
Why does the batch job stop after step 6 but the stream job keeps running?
Because batch mode processes all data at once and then finishes (see execution_table row 6), while stream mode processes data continuously and runs until manually stopped (row 12).
What changes in data reading between batch and stream modes?
Batch mode reads all data at once (row 3), while stream mode reads data continuously as it arrives (row 9). This affects how transformations and output writing happen.
How does the variable 'data' differ after reading in batch vs stream?
In batch, 'data' holds all loaded data at once (variable_tracker after step 3 batch), but in stream, 'data' represents an ongoing stream of data (variable_tracker after step 3 stream).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, at which step does the batch job write output data?
AStep 4
BStep 6
CStep 5
DStep 3
💡 Hint
Check the 'Action' and 'Output/Result' columns for batch mode in execution_table rows 4-6.
According to variable_tracker, what is the state of 'job_state' after step 5 in batch mode?
Acompleted
Brunning
Cinitialized
Dstopped
💡 Hint
Look at the 'job_state' row under 'After Step 5 (Batch)' in variable_tracker.
If the mode changes from 'batch' to 'stream', how does the data reading step change in execution_table?
AReads all data at once
BReads data continuously
CDoes not read data
DReads data after writing output
💡 Hint
Compare execution_table rows 3 (batch) and 9 (stream) under 'Action' and 'Data State'.
Concept Snapshot
Dataflow processes data in two modes:
- Batch: reads all data, processes, then finishes.
- Stream: reads data continuously, processes continuously.
Choose mode at job start.
Apply transformations to data.
Write results to output.
Batch jobs end; stream jobs run until stopped.
Full Transcript
Dataflow is a Google Cloud service that processes data either in batch or streaming mode. The job starts by reading data from a source. In batch mode, it reads all data at once, applies transformations, writes the output, and then finishes. In streaming mode, it reads data continuously as it arrives, applies transformations continuously, writes output continuously, and runs until manually stopped. Variables like 'data' and 'job_state' change differently depending on the mode. Batch jobs have a clear end, while streaming jobs run indefinitely. Understanding these steps helps in designing data processing pipelines effectively.