0
0
Apache Sparkdata~10 mins

Structured Streaming basics in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Structured Streaming basics
Start Streaming Query
Read Input Stream
Apply Transformations
Write Output Stream
Continuous Processing
Stop Streaming Query
Structured Streaming reads data continuously, applies transformations, writes output, and keeps running until stopped.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stream').getOrCreate()
input_stream = spark.readStream.format('socket').option('host', 'localhost').option('port', 9999).load()
words = input_stream.selectExpr('explode(split(value, " ")) as word')
query = words.writeStream.format('console').start()
query.awaitTermination()
This code reads text data from a socket, splits lines into words, and prints words to the console continuously.
Execution Table
StepActionInput DataTransformationOutput DataStatus
1Start Streaming QueryNo data yetInitialize streaming contextNo outputRunning
2Read Input StreamLine: 'hello world'Read line from socketDataFrame with 1 row: 'hello world'Running
3Apply TransformationDataFrame with 'hello world'Split line into wordsDataFrame with rows: 'hello', 'world'Running
4Write Output StreamDataFrame with wordsPrint to consoleConsole shows 'hello' and 'world'Running
5Read Input StreamLine: 'spark streaming'Read line from socketDataFrame with 1 row: 'spark streaming'Running
6Apply TransformationDataFrame with 'spark streaming'Split line into wordsDataFrame with rows: 'spark', 'streaming'Running
7Write Output StreamDataFrame with wordsPrint to consoleConsole shows 'spark' and 'streaming'Running
8Stop Streaming QueryN/AStop queryNo outputStopped
💡 Streaming query stopped by user or program
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5After Step 6After Step 7Final
input_streamempty'hello world''hello world''hello world''spark streaming''spark streaming''spark streaming'final state
wordsemptyempty['hello', 'world']['hello', 'world']['hello', 'world']['spark', 'streaming']['spark', 'streaming']final state
query.statusnot startedrunningrunningrunningrunningrunningrunningstopped
Key Moments - 3 Insights
Why does the streaming query keep running after processing some data?
Because Structured Streaming is designed to run continuously, it waits for new data until explicitly stopped, as shown in steps 1 to 7 in the execution_table.
What happens if no new data arrives in the input stream?
The streaming query stays active but does not output new data until new input arrives, as the status remains 'Running' without new output rows.
Why do we use 'awaitTermination()' at the end?
'awaitTermination()' keeps the program running to allow continuous processing, preventing the program from exiting immediately after starting the stream.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at Step 3, what transformation is applied to the input data?
AWriting data to console
BReading data from socket
CSplitting the line into words
DStopping the query
💡 Hint
Check the 'Transformation' column at Step 3 in the execution_table.
At which step does the streaming query stop running?
AStep 4
BStep 8
CStep 7
DStep 2
💡 Hint
Look at the 'Status' column in the execution_table for the step where status changes to 'Stopped'.
If the input stream receives no new data after Step 7, what will be the query status?
ARunning but no new output
BStopped
CError
DIdle and closed
💡 Hint
Refer to the key moment about query behavior when no new data arrives.
Concept Snapshot
Structured Streaming reads data continuously from a source.
It applies transformations like splitting or filtering.
Results are written to a sink like console or file.
The query runs continuously until stopped.
Use 'awaitTermination()' to keep the program running.
Ideal for real-time data processing.
Full Transcript
Structured Streaming in Apache Spark allows you to process data continuously as it arrives. You start a streaming query that reads data from a source, applies transformations such as splitting lines into words, and writes the results to an output sink like the console. The streaming query keeps running, waiting for new data, until you stop it manually. The method 'awaitTermination()' is used to keep the program running so the streaming can continue. This approach is useful for real-time data processing where data flows in continuously.