Challenge - 5 Problems

🎖️

Streaming Join Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of a stream-stream inner join with watermarks

Consider two streaming DataFrames df1 and df2 with event-time columns eventTime1 and eventTime2. Both have watermarks set with 10 minutes delay. They are joined on a key column and event times within 5 minutes. What is the output count after processing the following batches?

Batch 1:
df1: (key=1, eventTime1=10:00), (key=2, eventTime1=10:05)
df2: (key=1, eventTime2=10:03), (key=2, eventTime2=10:20)

Batch 2:
df1: (key=1, eventTime1=10:15)
df2: (key=1, eventTime2=10:10)

Apache Spark

from pyspark.sql.functions import expr

# Assume df1 and df2 are streaming DataFrames with schema (key int, eventTime timestamp)
# Watermarks set as:
df1_with_watermark = df1.withWatermark('eventTime1', '10 minutes')
df2_with_watermark = df2.withWatermark('eventTime2', '10 minutes')

joined = df1_with_watermark.alias("df1").join(
    df2_with_watermark.alias("df2"),
    expr("df1.key = df2.key AND df2.eventTime2 BETWEEN df1.eventTime1 AND df1.eventTime1 + interval 5 minutes"),
    'inner'
)

# After processing batches, count the joined rows
result_count = joined.count()

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Number of rows after left outer streaming join with late data

You have two streaming DataFrames orders and customers. You perform a left outer join on customerId. The customers stream has a watermark of 1 hour on updateTime. If a late customer update arrives after the watermark, will it appear in the join output?

Given the following data:

Orders batch:
(orderId=101, customerId=1, orderTime=12:00)
(orderId=102, customerId=2, orderTime=12:05)

Customers batch:
(customerId=1, updateTime=11:00, name='Alice')
(customerId=2, updateTime=10:00, name='Bob')

ABoth customers appear in join output

BOnly customerId=2 appears, customerId=1 is dropped due to watermark

COnly customerId=1 appears, customerId=2 is dropped due to late update

DNo customers appear because watermark drops all

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in stream-stream join code

Given the following Spark Structured Streaming code snippet, what error will it raise?

joined = df1.join(df2, on='key', how='inner')
query = joined.writeStream.format('console').start()
query.awaitTermination()

Assume both df1 and df2 are streaming DataFrames without watermarks.

Apache Spark

joined = df1.join(df2, on='key', how='inner')
query = joined.writeStream.format('console').start()
query.awaitTermination()

ANo error, code runs and outputs joined data

BAnalysisException: stream-stream join without watermark is not supported

CTimeoutException due to missing trigger interval

DNullPointerException due to missing schema

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Visualize state growth in stream-stream join with and without watermark

You run two streaming joins: one with watermarks on both streams and one without watermarks. Which graph best represents the state size over time?

AGraph showing state size growing linearly without watermark, stable with watermark

BGraph showing state size stable without watermark, growing with watermark

CGraph showing state size dropping to zero immediately in both cases

DGraph showing random fluctuations in state size regardless of watermark

Attempts:

2 left

🚀 Application

expert

3:00remaining

Design a streaming join to handle late data with event-time tolerance

You need to join two streams clicks and impressions on userId with event-time columns clickTime and impressionTime. You want to join clicks with impressions that happened within 10 minutes before or after the click. Late data can arrive up to 30 minutes late. Which approach correctly implements this with Spark Structured Streaming?

ASet watermarks on both streams with 30 minutes delay, join on userId and condition: impressionTime BETWEEN clickTime - interval 10 minutes AND clickTime + interval 10 minutes

BSet watermark only on clicks with 10 minutes delay, join on userId and impressionTime = clickTime

CNo watermarks, join on userId and impressionTime BETWEEN clickTime - interval 30 minutes AND clickTime + interval 30 minutes

DSet watermarks on both streams with 10 minutes delay, join on userId and impressionTime = clickTime

Attempts:

2 left