0
0
Apache Sparkdata~10 mins

Streaming joins in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Streaming joins
Start Streaming DataFrames
Define Join Condition
Apply Streaming Join
Process Joined Stream
Output Results Continuously
End
Streaming joins combine two continuous data streams based on a condition, producing joined output as new data arrives.
Execution Sample
Apache Spark
stream1 = spark.readStream.format("socket").option("host", "localhost").option("port", 9998).load()
stream2 = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
joined = stream1.join(stream2, stream1.value == stream2.value)
query = joined.writeStream.format("console").start()
This code reads two streaming sources from sockets, joins them on matching values, and outputs the joined data to the console continuously.
Execution Table
StepActionInput DataJoin ConditionJoined Output
1Read first batch from stream1["apple", "banana"]N/AN/A
2Read first batch from stream2["banana", "cherry"]N/AN/A
3Apply join on stream1.value == stream2.valuestream1: ["apple", "banana"]; stream2: ["banana", "cherry"]value equality["banana"]
4Output joined data to console["banana"]value equality["banana"]
5Read second batch from stream1["date"]N/AN/A
6Read second batch from stream2["date", "apple"]N/AN/A
7Apply join on new batchesstream1: ["date"]; stream2: ["date", "apple"]value equality["date"]
8Output joined data to console["date"]value equality["date"]
9No more data, streaming continues waitingN/AN/AN/A
💡 Streaming continues indefinitely; here we show two batches processed and output.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 5After Step 6After Step 7Final
stream1_data[]["apple", "banana"]["apple", "banana"]["apple", "banana"]["date"]["date"]["date"]["date"]
stream2_data[][]["banana", "cherry"]["banana", "cherry"]["banana", "cherry"]["date", "apple"]["date", "apple"]["date", "apple"]
joined_output[][][]["banana"]["banana"]["banana"]["date"]["date"]
Key Moments - 2 Insights
Why does the join output only show matching values from both streams?
Because the join condition requires values to be equal in both streams, only those matching values appear in the joined output as shown in execution_table rows 3 and 7.
What happens if one stream has data but the other does not?
No joined output is produced until matching data arrives in both streams, as seen in steps where one stream has new data but the other does not, no join output is emitted.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table at step 3. What is the joined output?
A["banana"]
B["apple"]
C["cherry"]
D[]
💡 Hint
Check the 'Joined Output' column at step 3 in the execution_table.
At which step does the join output include the value "date"?
AStep 5
BStep 3
CStep 7
DStep 9
💡 Hint
Look at the 'Joined Output' column for the value "date" in the execution_table.
If stream2 never sends "banana", what happens to the joined output at step 3?
AIt outputs ["banana"] anyway
BIt outputs an empty list []
CIt outputs all values from stream1
DIt stops the stream
💡 Hint
Refer to the join condition and output behavior in execution_table rows 3 and 4.
Concept Snapshot
Streaming joins combine two live data streams by matching rows based on a condition.
They continuously output joined results as new data arrives.
Use Spark's readStream to read, join() to combine, and writeStream to output.
Only matching data from both streams appears in output.
Streaming joins run indefinitely until stopped.
Full Transcript
Streaming joins in Apache Spark combine two continuous data streams based on a join condition, such as matching values. The process starts by reading streaming data from two sources. Then, a join condition is defined, for example, matching the 'value' field in both streams. The join is applied continuously as new data arrives. Joined results are output continuously, for example, to the console. The execution table shows step-by-step how batches from each stream are read, joined, and output. Variables track the data in each stream and the joined output after each step. Key moments clarify why only matching values appear in the output and what happens if one stream lacks matching data. The visual quiz tests understanding of the joined output at specific steps and the effect of missing data. The snapshot summarizes the key points: streaming joins combine live data streams, output matching rows continuously, and require both streams to have matching data to produce output.