Kafkadevops~5 mins

Why stream processing transforms data in Kafka - Why It Works

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Data often arrives continuously and needs to be changed or enriched immediately to be useful. Stream processing transforms this flowing data in real time so systems can react quickly and keep information fresh.

When you want to filter out unwanted data from a live feed before saving it

When you need to enrich incoming data by adding extra details on the fly

When you want to aggregate or summarize data continuously as it arrives

When you must detect patterns or anomalies in data streams instantly

When you want to route data to different systems based on its content

Config File - stream-processing.properties

stream-processing.properties

bootstrap.servers=localhost:9092
application.id=stream-transform-app
processing.guarantee=exactly_once
cache.max.bytes.buffering=10485760
commit.interval.ms=1000

bootstrap.servers: Kafka server address to connect to.

application.id: Unique ID for this stream processing app.

processing.guarantee: Ensures data is processed exactly once to avoid duplicates.

cache.max.bytes.buffering: Controls memory used for buffering data before processing.

commit.interval.ms: How often to save progress to Kafka.

Commands

Create a Kafka topic named 'raw-data' where unprocessed data will be sent.

Terminal

kafka-topics --create --topic raw-data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

Expected OutputExpected

Created topic raw-data.

→

--partitions - Number of partitions for parallel processing

→

--replication-factor - Number of copies for fault tolerance

Create a Kafka topic named 'transformed-data' to hold the processed and transformed data.

Terminal

kafka-topics --create --topic transformed-data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

Expected OutputExpected

Created topic transformed-data.

→

--partitions - Number of partitions for parallel processing

→

--replication-factor - Number of copies for fault tolerance

Send sample raw data messages to the 'raw-data' topic to simulate incoming data stream.

Terminal

kafka-console-producer --topic raw-data --bootstrap-server localhost:9092

Expected OutputExpected

No output (command runs silently)

→

--topic - Specify the topic to send data to

Start the stream processing application that reads from 'raw-data', transforms the data, and writes to 'transformed-data'.

Terminal

kafka-streams-application-start --config stream-processing.properties

Expected OutputExpected

Stream processing application started with application.id=stream-transform-app

→

--config - Specify the configuration file for the stream app

Read and display the transformed data from the 'transformed-data' topic to verify the transformation.

Terminal

kafka-console-consumer --topic transformed-data --bootstrap-server localhost:9092 --from-beginning

Expected OutputExpected

transformed message 1 transformed message 2 transformed message 3

→

--from-beginning - Read all messages from the start of the topic

Key Concept

Stream processing transforms data immediately as it flows in, enabling real-time insights and actions.

Common Mistakes

Not creating the output topic before starting the stream processing app

The app will fail to write transformed data if the output topic does not exist.

Always create both input and output topics before running the stream processing application.

Sending data to the wrong topic

If raw data is sent to the transformed-data topic, the app will not process it correctly.

Send raw data only to the input topic configured for the stream processor.

Not setting processing.guarantee to exactly_once

Without exactly-once processing, data might be duplicated or lost during transformation.

Set processing.guarantee=exactly_once in the config to ensure reliable processing.

Summary

Create Kafka topics for raw input data and transformed output data.

Send raw data to the input topic to simulate a live data stream.

Run a stream processing app that reads raw data, transforms it, and writes to the output topic.

Use a consumer to verify the transformed data is correctly produced.