Overview - Stream topology

What is it?

Stream topology is the layout or design of how data flows and is processed in a Kafka Streams application. It shows the sequence of operations like filtering, mapping, joining, and aggregating data streams. Each step in the topology represents a processing node that transforms or routes data. This design helps organize complex data processing tasks into clear, manageable parts.

Why it matters

Without stream topology, managing and understanding how data moves and changes in real-time would be chaotic. It solves the problem of organizing continuous data processing so developers can build reliable, scalable, and maintainable streaming applications. Without it, debugging, scaling, or evolving streaming logic would be very difficult, leading to errors and slow systems.

Where it fits

Before learning stream topology, you should understand Kafka basics like topics, producers, and consumers. After mastering topology, you can explore advanced Kafka Streams features like state stores, windowing, and fault tolerance. It fits in the journey between basic Kafka usage and building complex real-time data pipelines.

Mental Model

Core Idea

Stream topology is the blueprint that maps how data flows through processing steps in a Kafka Streams application.

Think of it like...

Imagine a factory assembly line where raw materials enter, pass through different machines that shape, inspect, and combine parts, and finally produce a finished product. Each machine is like a processing step in the stream topology, and the conveyor belts represent the data flow between them.

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Source Topic  │ --> │ Processor 1   │ --> │ Processor 2   │ --> ...
└───────────────┘     └───────────────┘     └───────────────┘
                             │                     │
                             ▼                     ▼
                      ┌───────────────┐     ┌───────────────┐
                      │ State Store   │     │ Sink Topic    │
                      └───────────────┘     └───────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding Kafka Streams Basics

Concept: Introduce Kafka Streams as a library to process data streams in real-time.

Kafka Streams lets you read data from Kafka topics, process it with operations like filtering or mapping, and write results back to topics. It runs inside your application, making it easy to build streaming apps without separate clusters.

Result

You can create simple stream processing apps that consume and produce Kafka data.

Knowing Kafka Streams basics is essential because stream topology builds on how streams are created and processed.

2

FoundationWhat is a Stream Topology?

3

IntermediateBuilding a Simple Topology Programmatically

4

IntermediateTopology Components: Sources, Processors, Sinks

5

IntermediateVisualizing and Inspecting Topologies

6

AdvancedState Stores and Their Role in Topology

7

ExpertTopology Optimization and Internal Mechanics

Under the Hood

Kafka Streams represents the stream topology as a directed acyclic graph where nodes are processors and edges are data flows. At runtime, this graph is divided into tasks assigned to threads. Each task processes data from partitions, manages local state stores, and commits results back to Kafka. The library handles fault tolerance by replaying data from Kafka and restoring state stores.

Why designed this way?

This design allows Kafka Streams to be scalable, fault-tolerant, and easy to use. Using a graph model matches the natural flow of data processing. Dividing work into tasks enables parallelism and efficient resource use. Alternatives like centralized processing or manual partition management were more complex and less resilient.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Source Node   │──────▶│ Processor Node│──────▶│ Sink Node     │
└───────────────┘       └───────────────┘       └───────────────┘
        │                      │                       │
        ▼                      ▼                       ▼
  ┌───────────────┐      ┌───────────────┐       ┌───────────────┐
  │ Kafka Topic   │      │ State Store   │       │ Kafka Topic   │
  └───────────────┘      └───────────────┘       └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Is a stream topology a static configuration file or dynamic code structure? Commit to your answer.

Common Belief:Stream topology is a fixed configuration file you write once and never change.

Tap to reveal reality

Quick: Do you think processors can run independently without source nodes? Commit to your answer.

Common Belief:Processors can exist and run without any source nodes feeding data.

Tap to reveal reality

Quick: Do you think state stores are always required in Kafka Streams topologies? Commit to your answer.

Common Belief:Every Kafka Streams topology must include state stores to function.

Tap to reveal reality

Quick: Do you think Kafka Streams executes your topology exactly as you wrote it without changes? Commit to your answer.

Common Belief:Kafka Streams runs the topology exactly as defined in code without any internal changes.

Tap to reveal reality

Expert Zone

1

Topology nodes can be reused internally by Kafka Streams to optimize resource use, which can affect how metrics and logs appear.

2

The order of processor nodes matters because Kafka Streams processes data in the defined sequence, impacting latency and correctness.

3

State stores are backed by changelog topics in Kafka, enabling fault tolerance by replaying state changes after failures.

When NOT to use

Stream topology is not suitable for batch processing or very low-latency microsecond-level processing. For batch jobs, use tools like Apache Spark. For ultra-low latency, consider specialized stream processors or in-memory databases.

Production Patterns

In production, topologies are designed modularly with reusable sub-topologies for common tasks. They use state stores for joins and aggregations, and leverage Kafka Streams' fault tolerance by carefully managing changelog topics and commit intervals.

Connections

Dataflow Programming

Stream topology is a specific example of dataflow programming where computation is modeled as a graph of operations.

Understanding stream topology deepens comprehension of dataflow concepts used in many real-time and parallel processing systems.

Factory Assembly Lines

Both organize sequential steps to transform inputs into outputs efficiently.

Recognizing this connection helps appreciate how stream topology structures complex data transformations clearly and reliably.

Neural Networks

Like stream topologies, neural networks are graphs of nodes processing data in layers.

Seeing this similarity reveals how graph-based processing models are powerful across domains from data engineering to AI.

Common Pitfalls

#1Defining topology without connecting source nodes.

Wrong approach:StreamsBuilder builder = new StreamsBuilder(); KStream stream = builder.stream(); // missing topic name stream.filter(...).to("output-topic");

Correct approach:StreamsBuilder builder = new StreamsBuilder(); KStream stream = builder.stream("input-topic"); stream.filter(...).to("output-topic");

Root cause:Not specifying source topics means no data enters the topology, so processing never happens.

#2Assuming state stores are automatically created for all operations.

Wrong approach:builder.stream("input").map(...).to("output"); // expecting state store for map

Correct approach:builder.stream("input").groupByKey().aggregate(...); // explicitly creates state store

Root cause:State stores only exist for stateful operations; stateless ones like map do not create them.

#3Ignoring topology visualization and debugging tools.

Wrong approach:Not printing or inspecting topology structure, leading to confusion about data flow.

Correct approach:System.out.println(builder.build().describe()); // prints topology graph

Root cause:Skipping visualization makes it hard to verify and understand complex topologies.

Key Takeaways

Stream topology is the map of how data flows and is processed in Kafka Streams applications.

It is defined dynamically in code using Kafka Streams API, not static configuration files.

Topology consists of source nodes, processor nodes, sink nodes, and optionally state stores for stateful processing.

Kafka Streams internally optimizes and manages topology execution for scalability and fault tolerance.

Understanding topology helps build, debug, and optimize real-time streaming data pipelines effectively.