0
0
Apache Sparkdata~20 mins

Streaming joins in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Streaming Join Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of a stream-stream inner join with watermarks

Consider two streaming DataFrames df1 and df2 with event-time columns eventTime1 and eventTime2. Both have watermarks set with 10 minutes delay. They are joined on a key column and event times within 5 minutes. What is the output count after processing the following batches?

Batch 1:
df1: (key=1, eventTime1=10:00), (key=2, eventTime1=10:05)
df2: (key=1, eventTime2=10:03), (key=2, eventTime2=10:20)

Batch 2:
df1: (key=1, eventTime1=10:15)
df2: (key=1, eventTime2=10:10)

Apache Spark
from pyspark.sql.functions import expr

# Assume df1 and df2 are streaming DataFrames with schema (key int, eventTime timestamp)
# Watermarks set as:
df1_with_watermark = df1.withWatermark('eventTime1', '10 minutes')
df2_with_watermark = df2.withWatermark('eventTime2', '10 minutes')

joined = df1_with_watermark.alias("df1").join(
    df2_with_watermark.alias("df2"),
    expr("df1.key = df2.key AND df2.eventTime2 BETWEEN df1.eventTime1 AND df1.eventTime1 + interval 5 minutes"),
    'inner'
)

# After processing batches, count the joined rows
result_count = joined.count()
A2
B3
C4
D1
Attempts:
2 left
💡 Hint

Remember that watermarks drop late data and the join condition requires event times within 5 minutes.

data_output
intermediate
2:00remaining
Number of rows after left outer streaming join with late data

You have two streaming DataFrames orders and customers. You perform a left outer join on customerId. The customers stream has a watermark of 1 hour on updateTime. If a late customer update arrives after the watermark, will it appear in the join output?

Given the following data:

Orders batch:
(orderId=101, customerId=1, orderTime=12:00)
(orderId=102, customerId=2, orderTime=12:05)

Customers batch:
(customerId=1, updateTime=11:00, name='Alice')
(customerId=2, updateTime=10:00, name='Bob')

ABoth customers appear in join output
BOnly customerId=2 appears, customerId=1 is dropped due to watermark
COnly customerId=1 appears, customerId=2 is dropped due to late update
DNo customers appear because watermark drops all
Attempts:
2 left
💡 Hint

Watermark drops data older than current watermark threshold.

🔧 Debug
advanced
2:00remaining
Identify the error in stream-stream join code

Given the following Spark Structured Streaming code snippet, what error will it raise?

joined = df1.join(df2, on='key', how='inner')
query = joined.writeStream.format('console').start()
query.awaitTermination()

Assume both df1 and df2 are streaming DataFrames without watermarks.

Apache Spark
joined = df1.join(df2, on='key', how='inner')
query = joined.writeStream.format('console').start()
query.awaitTermination()
ANo error, code runs and outputs joined data
BAnalysisException: stream-stream join without watermark is not supported
CTimeoutException due to missing trigger interval
DNullPointerException due to missing schema
Attempts:
2 left
💡 Hint

Stream-stream joins require watermarks to handle state cleanup.

visualization
advanced
2:00remaining
Visualize state growth in stream-stream join with and without watermark

You run two streaming joins: one with watermarks on both streams and one without watermarks. Which graph best represents the state size over time?

AGraph showing state size growing linearly without watermark, stable with watermark
BGraph showing state size stable without watermark, growing with watermark
CGraph showing state size dropping to zero immediately in both cases
DGraph showing random fluctuations in state size regardless of watermark
Attempts:
2 left
💡 Hint

Watermarks help Spark remove old state data.

🚀 Application
expert
3:00remaining
Design a streaming join to handle late data with event-time tolerance

You need to join two streams clicks and impressions on userId with event-time columns clickTime and impressionTime. You want to join clicks with impressions that happened within 10 minutes before or after the click. Late data can arrive up to 30 minutes late. Which approach correctly implements this with Spark Structured Streaming?

ASet watermarks on both streams with 30 minutes delay, join on userId and condition: impressionTime BETWEEN clickTime - interval 10 minutes AND clickTime + interval 10 minutes
BSet watermark only on clicks with 10 minutes delay, join on userId and impressionTime = clickTime
CNo watermarks, join on userId and impressionTime BETWEEN clickTime - interval 30 minutes AND clickTime + interval 30 minutes
DSet watermarks on both streams with 10 minutes delay, join on userId and impressionTime = clickTime
Attempts:
2 left
💡 Hint

Watermarks must cover maximum allowed lateness. Join condition must cover the time window.