Concept Flow - Kappa architecture (streaming only)

Data Ingested from Source

↓

Stream Processing Layer

↓

Real-time Processing & Analytics

↓

Serving Layer / Output

↓

Feedback / Monitoring

↩Back to Stream Processing Layer

Data flows continuously from source through a single streaming layer for real-time processing and output, with feedback looping back for monitoring.

Execution Sample

Hadoop

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('KappaExample').getOrCreate()
streamingDF = spark.readStream.format('kafka').option('subscribe', 'topic').load()
processedDF = streamingDF.selectExpr('CAST(value AS STRING)')
query = processedDF.writeStream.format('console').start()
query.awaitTermination()

This code reads streaming data from Kafka, processes it by casting values to strings, and outputs results to the console in real-time.

Execution Table

Step	Action	Input Data	Processing	Output / State
1	Start Spark Session	None	Initialize Spark streaming context	Spark session ready
2	Read Stream	Kafka topic messages	Connect to Kafka, read raw bytes	Streaming DataFrame with raw data
3	Select & Cast	Raw bytes	Convert bytes to string	Streaming DataFrame with string values
4	Write Stream	Processed data	Output to console sink	Streaming query started
5	Await Termination	Streaming query active	Keep process running	Continuous output of streaming data
6	Stop	User interrupt	Stop streaming query	Streaming stopped

💡 Streaming stops when user interrupts or error occurs; otherwise runs continuously.

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	Final
spark	None	SparkSession object	SparkSession object	SparkSession object	SparkSession object
streamingDF	None	DataFrame with raw Kafka data	DataFrame with string values	DataFrame with string values	DataFrame with string values
processedDF	None	None	DataFrame with string values	DataFrame with string values	DataFrame with string values
query	None	None	None	StreamingQuery object	StreamingQuery object or stopped

Key Moments - 3 Insights

Why does the streaming query keep running instead of stopping after processing one batch?

What happens if new data arrives after the initial processing?

Why is there only one processing layer instead of batch and streaming layers?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 3, what transformation is applied to the input data?

ACasting raw bytes to string

BFiltering data by condition

CAggregating data by key

DWriting data to storage

Concept Snapshot

Kappa architecture uses a single streaming layer for real-time data processing.
Data flows continuously from source to output.
No separate batch layer is used.
Streaming queries run continuously until stopped.
Ideal for simple, real-time analytics pipelines.

Full Transcript

Kappa architecture focuses on processing data as a continuous stream. Data is ingested from sources like Kafka and processed in a single streaming layer. This layer transforms and outputs data in real-time, without a separate batch layer. The streaming query runs continuously, processing new data as it arrives. The example code shows reading from Kafka, casting data to strings, and outputting to the console. The execution table traces each step from starting Spark to stopping the stream. Variables like streamingDF and query change state as the stream runs. Key moments clarify why streaming runs continuously and why only one processing layer is used. The visual quiz tests understanding of data transformation, output start, and stopping the stream. The snapshot summarizes the core idea: one streaming layer for real-time processing in Kappa architecture.