What is Kappa architecture (streaming only) in Hadoop?

Hadoopdata~5 mins

Kappa architecture (streaming only) in Hadoop

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Kappa architecture helps process data as it arrives, making analysis fast and up-to-date.

When you want to analyze live data like social media feeds or sensor data.

When you need to update results instantly as new data comes in.

When storing and reprocessing old data is not needed or too complex.

When you want a simpler system that handles only streaming data.

When you want to build real-time dashboards or alerts.

Syntax

Hadoop

No fixed code syntax because Kappa is an architecture style, not a programming language.

Typical components include:
- Data source (streaming data input)
- Stream processing engine (e.g., Apache Kafka, Apache Flink)
- Serving layer (to store processed results)

Data flows only once through the stream processor.

Kappa architecture uses only one data pipeline for streaming data.

It avoids batch processing by reprocessing streams if needed.

Examples

This shows a simple Kappa pipeline where data flows from Kafka to Flink for processing, then results go to Cassandra for queries.

Hadoop

Data source (Kafka) -> Stream processor (Flink) -> Serving database (Cassandra)

Here, sensor data is processed live and shown immediately on a dashboard.

Hadoop

Data source (IoT sensors) -> Stream processor (Spark Streaming) -> Real-time dashboard

Sample Program

This example shows a streaming job reading data from Kafka, counting messages per minute, and printing results live.

Hadoop

from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col

# Create Spark session
spark = SparkSession.builder.appName('KappaExample').getOrCreate()

# Read streaming data from Kafka
streaming_df = spark.readStream.format('kafka') \
  .option('kafka.bootstrap.servers', 'localhost:9092') \
  .option('subscribe', 'sensor-data') \
  .load()

# Convert value to string
sensor_values = streaming_df.selectExpr('CAST(value AS STRING) as value', 'timestamp')

# Simple processing: count messages per 1 minute window
counts = sensor_values.groupBy(window(col('timestamp'), '1 minute')).count()

# Output to console for demo
query = counts.writeStream.outputMode('complete').format('console').start()

query.awaitTermination(10)  # Run for 10 seconds

spark.stop()

OutputSuccess

Important Notes

Kappa architecture simplifies data pipelines by focusing only on streaming data.

It is easier to maintain than Lambda architecture because it avoids batch layers.

Reprocessing data means replaying streams from the start if needed.

Summary

Kappa architecture processes data only once as a stream.

It is useful for real-time data and simpler system design.

Common tools include Kafka for streaming and Spark or Flink for processing.