0
0
Kafkadevops~10 mins

GroupBy and aggregation in Kafka - Commands & Configuration

Choose your learning style9 modes available
Introduction
When you have many messages flowing through Kafka, you often want to group them by a key and calculate summaries like counts or sums. GroupBy and aggregation help you organize and analyze streaming data in real time.
When you want to count how many times each user sends messages in a stream.
When you need to sum sales amounts grouped by product category as data flows in.
When you want to find the average temperature per city from sensor data streaming through Kafka.
When you want to track the number of events per minute grouped by event type.
When you want to create a real-time leaderboard by grouping scores by player.
Config File - kafka-streams-aggregation.properties
kafka-streams-aggregation.properties
application.id=groupby-aggregation-app
bootstrap.servers=localhost:9092
cache.max.bytes.buffering=0
processing.guarantee=at_least_once
default.key.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde=org.apache.kafka.common.serialization.Serdes$LongSerde

This configuration file sets up the Kafka Streams application with a unique application ID and connects it to the Kafka server at localhost:9092. It disables caching for immediate results and sets serialization for keys as strings and values as long integers, which are common for counting or summing operations.

Commands
Create a Kafka topic named 'sales' with 3 partitions to distribute data and 1 replication for fault tolerance.
Terminal
kafka-topics --create --topic sales --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
Expected OutputExpected
Created topic sales.
--partitions - Number of partitions for parallelism
--replication-factor - Number of copies for fault tolerance
Start a producer to send sales data messages to the 'sales' topic.
Terminal
kafka-console-producer --topic sales --bootstrap-server localhost:9092
Expected OutputExpected
No output (command runs silently)
--topic - Topic to send messages to
--bootstrap-server - Kafka server address
Run the Kafka Streams application that groups sales by product and sums the amounts in real time.
Terminal
java -jar kafka-streams-aggregation.jar
Expected OutputExpected
Starting Kafka Streams application... Topology built successfully. Streams started.
Consume and display the aggregated sales results from the 'aggregated-sales' topic to verify the groupBy and sum operation.
Terminal
kafka-console-consumer --topic aggregated-sales --bootstrap-server localhost:9092 --from-beginning
Expected OutputExpected
productA 150 productB 200 productA 180
--from-beginning - Read all messages from the start
Key Concept

If you remember nothing else from this pattern, remember: grouping messages by key and applying aggregation functions lets you summarize streaming data in real time.

Common Mistakes
Not setting the correct key serializer and deserializer in the configuration.
Kafka Streams cannot group messages properly without correct key serialization, causing runtime errors or wrong grouping.
Always configure key serde (serializer/deserializer) to match the key data type, like StringSerde for string keys.
Forgetting to create the output topic before running the streams application.
The application fails to write aggregated results if the output topic does not exist.
Create all input and output topics with appropriate partitions before starting the streams app.
Using caching without understanding it delays output results.
Caching buffers results and delays aggregation output, confusing beginners expecting immediate results.
Set cache.max.bytes.buffering=0 in config to disable caching for learning and immediate output.
Summary
Create Kafka topics to hold input and output data streams.
Configure Kafka Streams with proper serializers and disable caching for immediate aggregation.
Run a Kafka Streams app that groups messages by key and applies aggregation like sum or count.
Use console consumer to verify the aggregated results in the output topic.