Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to perform an inner join on streaming DataFrames.
Apache Spark
joined_stream = stream_df1.join(stream_df2, on=[1], how='inner')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using a column not present in both DataFrames causes errors.
Using a non-unique column can cause unexpected duplicates.
✗ Incorrect
The join key is usually a common column like 'id' to match records between streams.
2fill in blank
mediumComplete the code to specify a watermark on the streaming DataFrame.
Apache Spark
stream_df = stream_df.withWatermark('[1]', '10 minutes')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Applying watermark on processing time instead of event time.
Using a column that is not a timestamp type.
✗ Incorrect
Watermarking is applied on event time columns to handle late data.
3fill in blank
hardFix the error in the join condition for streaming DataFrames.
Apache Spark
joined_stream = stream_df1.join(stream_df2, on=stream_df1.[1] == stream_df2.id) Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using different columns that do not match for join condition.
Using a column that does not exist in stream_df1.
✗ Incorrect
The join condition must compare the same key column from both DataFrames.
4fill in blank
hardFill both blanks to create a streaming join with watermark and time range condition.
Apache Spark
joined_stream = stream_df1.withWatermark('[1]', '5 minutes')\n .join(stream_df2.withWatermark('[2]', '5 minutes'), on='id')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using different columns for watermark in each stream.
Using processing time instead of event time.
✗ Incorrect
Watermarking both streams on the event time column is required for stream-stream joins.
5fill in blank
hardFill all three blanks to filter joined streaming data by time range and select columns.
Apache Spark
result = joined_stream.filter((joined_stream.[1] >= joined_stream.[2] - expr('interval 1 hour')) &\n (joined_stream.[3] <= joined_stream.[2] + expr('interval 1 hour')))\n .select('id', 'value', 'timestamp')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Mixing up timestamp and eventTime columns.
Using processingTime which is not suitable for event-time filtering.
✗ Incorrect
Filtering uses the timestamp column compared to eventTime with a 1 hour window.