0
0
Kafkadevops~5 mins

Why stream processing transforms data in Kafka - Why It Works

Choose your learning style9 modes available
Introduction
Data often arrives continuously and needs to be changed or enriched immediately to be useful. Stream processing transforms this flowing data in real time so systems can react quickly and keep information fresh.
When you want to filter out unwanted data from a live feed before saving it
When you need to enrich incoming data by adding extra details on the fly
When you want to aggregate or summarize data continuously as it arrives
When you must detect patterns or anomalies in data streams instantly
When you want to route data to different systems based on its content
Config File - stream-processing.properties
stream-processing.properties
bootstrap.servers=localhost:9092
application.id=stream-transform-app
processing.guarantee=exactly_once
cache.max.bytes.buffering=10485760
commit.interval.ms=1000

bootstrap.servers: Kafka server address to connect to.

application.id: Unique ID for this stream processing app.

processing.guarantee: Ensures data is processed exactly once to avoid duplicates.

cache.max.bytes.buffering: Controls memory used for buffering data before processing.

commit.interval.ms: How often to save progress to Kafka.

Commands
Create a Kafka topic named 'raw-data' where unprocessed data will be sent.
Terminal
kafka-topics --create --topic raw-data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
Expected OutputExpected
Created topic raw-data.
--partitions - Number of partitions for parallel processing
--replication-factor - Number of copies for fault tolerance
Create a Kafka topic named 'transformed-data' to hold the processed and transformed data.
Terminal
kafka-topics --create --topic transformed-data --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
Expected OutputExpected
Created topic transformed-data.
--partitions - Number of partitions for parallel processing
--replication-factor - Number of copies for fault tolerance
Send sample raw data messages to the 'raw-data' topic to simulate incoming data stream.
Terminal
kafka-console-producer --topic raw-data --bootstrap-server localhost:9092
Expected OutputExpected
No output (command runs silently)
--topic - Specify the topic to send data to
Start the stream processing application that reads from 'raw-data', transforms the data, and writes to 'transformed-data'.
Terminal
kafka-streams-application-start --config stream-processing.properties
Expected OutputExpected
Stream processing application started with application.id=stream-transform-app
--config - Specify the configuration file for the stream app
Read and display the transformed data from the 'transformed-data' topic to verify the transformation.
Terminal
kafka-console-consumer --topic transformed-data --bootstrap-server localhost:9092 --from-beginning
Expected OutputExpected
transformed message 1 transformed message 2 transformed message 3
--from-beginning - Read all messages from the start of the topic
Key Concept

Stream processing transforms data immediately as it flows in, enabling real-time insights and actions.

Common Mistakes
Not creating the output topic before starting the stream processing app
The app will fail to write transformed data if the output topic does not exist.
Always create both input and output topics before running the stream processing application.
Sending data to the wrong topic
If raw data is sent to the transformed-data topic, the app will not process it correctly.
Send raw data only to the input topic configured for the stream processor.
Not setting processing.guarantee to exactly_once
Without exactly-once processing, data might be duplicated or lost during transformation.
Set processing.guarantee=exactly_once in the config to ensure reliable processing.
Summary
Create Kafka topics for raw input data and transformed output data.
Send raw data to the input topic to simulate a live data stream.
Run a stream processing app that reads raw data, transforms it, and writes to the output topic.
Use a consumer to verify the transformed data is correctly produced.