Concept Flow - Streaming joins

Start Streaming DataFrames

↓

Define Join Condition

↓

Apply Streaming Join

↓

Process Joined Stream

↓

Output Results Continuously

↓

End

Streaming joins combine two continuous data streams based on a condition, producing joined output as new data arrives.

Execution Sample

Apache Spark

stream1 = spark.readStream.format("socket").option("host", "localhost").option("port", 9998).load()
stream2 = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
joined = stream1.join(stream2, stream1.value == stream2.value)
query = joined.writeStream.format("console").start()

This code reads two streaming sources from sockets, joins them on matching values, and outputs the joined data to the console continuously.

Execution Table

Step	Action	Input Data	Join Condition	Joined Output
1	Read first batch from stream1	["apple", "banana"]	N/A	N/A
2	Read first batch from stream2	["banana", "cherry"]	N/A	N/A
3	Apply join on stream1.value == stream2.value	stream1: ["apple", "banana"]; stream2: ["banana", "cherry"]	value equality	["banana"]
4	Output joined data to console	["banana"]	value equality	["banana"]
5	Read second batch from stream1	["date"]	N/A	N/A
6	Read second batch from stream2	["date", "apple"]	N/A	N/A
7	Apply join on new batches	stream1: ["date"]; stream2: ["date", "apple"]	value equality	["date"]
8	Output joined data to console	["date"]	value equality	["date"]
9	No more data, streaming continues waiting	N/A	N/A	N/A

💡 Streaming continues indefinitely; here we show two batches processed and output.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	After Step 5	After Step 6	After Step 7	Final
stream1_data	[]	["apple", "banana"]	["apple", "banana"]	["apple", "banana"]	["date"]	["date"]	["date"]	["date"]
stream2_data	[]	[]	["banana", "cherry"]	["banana", "cherry"]	["banana", "cherry"]	["date", "apple"]	["date", "apple"]	["date", "apple"]
joined_output	[]	[]	[]	["banana"]	["banana"]	["banana"]	["date"]	["date"]

Key Moments - 2 Insights

Why does the join output only show matching values from both streams?

What happens if one stream has data but the other does not?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table at step 3. What is the joined output?

A["banana"]

B["apple"]

C["cherry"]

D[]

Concept Snapshot

Streaming joins combine two live data streams by matching rows based on a condition.
They continuously output joined results as new data arrives.
Use Spark's readStream to read, join() to combine, and writeStream to output.
Only matching data from both streams appears in output.
Streaming joins run indefinitely until stopped.

Full Transcript

Streaming joins in Apache Spark combine two continuous data streams based on a join condition, such as matching values. The process starts by reading streaming data from two sources. Then, a join condition is defined, for example, matching the 'value' field in both streams. The join is applied continuously as new data arrives. Joined results are output continuously, for example, to the console. The execution table shows step-by-step how batches from each stream are read, joined, and output. Variables track the data in each stream and the joined output after each step. Key moments clarify why only matching values appear in the output and what happens if one stream lacks matching data. The visual quiz tests understanding of the joined output at specific steps and the effect of missing data. The snapshot summarizes the key points: streaming joins combine live data streams, output matching rows continuously, and require both streams to have matching data to produce output.