0
0
Hadoopdata~10 mins

Kappa architecture (streaming only) in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Kappa architecture (streaming only)
Data Ingested from Source
Stream Processing Layer
Real-time Processing & Analytics
Serving Layer / Output
Feedback / Monitoring
Back to Stream Processing Layer
Data flows continuously from source through a single streaming layer for real-time processing and output, with feedback looping back for monitoring.
Execution Sample
Hadoop
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('KappaExample').getOrCreate()
streamingDF = spark.readStream.format('kafka').option('subscribe', 'topic').load()
processedDF = streamingDF.selectExpr('CAST(value AS STRING)')
query = processedDF.writeStream.format('console').start()
query.awaitTermination()
This code reads streaming data from Kafka, processes it by casting values to strings, and outputs results to the console in real-time.
Execution Table
StepActionInput DataProcessingOutput / State
1Start Spark SessionNoneInitialize Spark streaming contextSpark session ready
2Read StreamKafka topic messagesConnect to Kafka, read raw bytesStreaming DataFrame with raw data
3Select & CastRaw bytesConvert bytes to stringStreaming DataFrame with string values
4Write StreamProcessed dataOutput to console sinkStreaming query started
5Await TerminationStreaming query activeKeep process runningContinuous output of streaming data
6StopUser interruptStop streaming queryStreaming stopped
💡 Streaming stops when user interrupts or error occurs; otherwise runs continuously.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4Final
sparkNoneSparkSession objectSparkSession objectSparkSession objectSparkSession object
streamingDFNoneDataFrame with raw Kafka dataDataFrame with string valuesDataFrame with string valuesDataFrame with string values
processedDFNoneNoneDataFrame with string valuesDataFrame with string valuesDataFrame with string values
queryNoneNoneNoneStreamingQuery objectStreamingQuery object or stopped
Key Moments - 3 Insights
Why does the streaming query keep running instead of stopping after processing one batch?
Because streaming queries in Kappa architecture run continuously to process data in real-time, as shown in execution_table step 5 where awaitTermination keeps the process alive.
What happens if new data arrives after the initial processing?
The streaming layer processes new data continuously, updating the output in real-time, demonstrated by the loop in concept_flow and continuous output in execution_table step 5.
Why is there only one processing layer instead of batch and streaming layers?
Kappa architecture uses a single streaming layer for simplicity and real-time processing, unlike Lambda which has separate batch and streaming layers, as described in concept_flow.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 3, what transformation is applied to the input data?
ACasting raw bytes to string
BFiltering data by condition
CAggregating data by key
DWriting data to storage
💡 Hint
Check the 'Processing' column in step 3 of execution_table.
At which step does the streaming query start writing output to the console?
AStep 3
BStep 4
CStep 2
DStep 5
💡 Hint
Look at the 'Action' and 'Output / State' columns in execution_table.
If the streaming query is stopped by the user, which step in the execution_table corresponds to this?
AStep 1
BStep 4
CStep 6
DStep 5
💡 Hint
Refer to the 'Action' column describing stopping the streaming query.
Concept Snapshot
Kappa architecture uses a single streaming layer for real-time data processing.
Data flows continuously from source to output.
No separate batch layer is used.
Streaming queries run continuously until stopped.
Ideal for simple, real-time analytics pipelines.
Full Transcript
Kappa architecture focuses on processing data as a continuous stream. Data is ingested from sources like Kafka and processed in a single streaming layer. This layer transforms and outputs data in real-time, without a separate batch layer. The streaming query runs continuously, processing new data as it arrives. The example code shows reading from Kafka, casting data to strings, and outputting to the console. The execution table traces each step from starting Spark to stopping the stream. Variables like streamingDF and query change state as the stream runs. Key moments clarify why streaming runs continuously and why only one processing layer is used. The visual quiz tests understanding of data transformation, output start, and stopping the stream. The snapshot summarizes the core idea: one streaming layer for real-time processing in Kappa architecture.