0
0
Apache Sparkdata~10 mins

Streaming joins in Apache Spark - Interactive Code Practice

Choose your learning style9 modes available
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to perform an inner join on streaming DataFrames.

Apache Spark
joined_stream = stream_df1.join(stream_df2, on=[1], how='inner')
Drag options to blanks, or click blank then click option'
A'value'
B'timestamp'
C'id'
D'name'
Attempts:
3 left
💡 Hint
Common Mistakes
Using a column not present in both DataFrames causes errors.
Using a non-unique column can cause unexpected duplicates.
2fill in blank
medium

Complete the code to specify a watermark on the streaming DataFrame.

Apache Spark
stream_df = stream_df.withWatermark('[1]', '10 minutes')
Drag options to blanks, or click blank then click option'
AeventTime
BprocessingTime
Ctimestamp
Ddate
Attempts:
3 left
💡 Hint
Common Mistakes
Applying watermark on processing time instead of event time.
Using a column that is not a timestamp type.
3fill in blank
hard

Fix the error in the join condition for streaming DataFrames.

Apache Spark
joined_stream = stream_df1.join(stream_df2, on=stream_df1.[1] == stream_df2.id)
Drag options to blanks, or click blank then click option'
Atimestamp
Bvalue
Cname
Did
Attempts:
3 left
💡 Hint
Common Mistakes
Using different columns that do not match for join condition.
Using a column that does not exist in stream_df1.
4fill in blank
hard

Fill both blanks to create a streaming join with watermark and time range condition.

Apache Spark
joined_stream = stream_df1.withWatermark('[1]', '5 minutes')\n    .join(stream_df2.withWatermark('[2]', '5 minutes'), on='id')
Drag options to blanks, or click blank then click option'
AeventTime
BprocessingTime
Ctimestamp
Ddate
Attempts:
3 left
💡 Hint
Common Mistakes
Using different columns for watermark in each stream.
Using processing time instead of event time.
5fill in blank
hard

Fill all three blanks to filter joined streaming data by time range and select columns.

Apache Spark
result = joined_stream.filter((joined_stream.[1] >= joined_stream.[2] - expr('interval 1 hour')) &\n                              (joined_stream.[3] <= joined_stream.[2] + expr('interval 1 hour')))\n               .select('id', 'value', 'timestamp')
Drag options to blanks, or click blank then click option'
Atimestamp
BeventTime
CprocessingTime
Dtime
Attempts:
3 left
💡 Hint
Common Mistakes
Mixing up timestamp and eventTime columns.
Using processingTime which is not suitable for event-time filtering.