Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to set a watermark on the streaming DataFrame to handle late data.
Apache Spark
streamingDF = streamingDF.withWatermark('[1]', '10 minutes')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using a column name that does not exist in the DataFrame.
Using a non-timestamp column for watermarking.
✗ Incorrect
The 'timestamp' column is commonly used as the event time column for watermarking in Spark Structured Streaming.
2fill in blank
mediumComplete the code to drop late data that is older than the watermark delay.
Apache Spark
cleanedDF = streamingDF.dropDuplicates(['userId', '[1]'])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using a non-unique column for dropping duplicates.
Using a column different from the watermark timestamp.
✗ Incorrect
Dropping duplicates based on 'userId' and 'timestamp' helps remove late data beyond the watermark.
3fill in blank
hardFix the error in the watermarking code by completing the missing argument.
Apache Spark
streamingDF = streamingDF.withWatermark('timestamp', '[1]')
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using only a number without units.
Using incorrect unit format.
✗ Incorrect
The watermark delay must be a string with a number and a time unit, e.g., '10 minutes'.
4fill in blank
hardFill both blanks to create a streaming aggregation with watermarking and windowing.
Apache Spark
result = streamingDF.withWatermark('[1]', '5 minutes').groupBy(window('[2]', '10 minutes')).count()
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using different columns for watermark and window.
Using a non-timestamp column.
✗ Incorrect
Both watermark and window functions should use the 'timestamp' column for consistency.
5fill in blank
hardFill all three blanks to filter late data using watermark and event time comparison.
Apache Spark
filteredDF = streamingDF.withWatermark('[1]', '15 minutes').filter(streamingDF.[2] > streamingDF.[3])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using inconsistent column names.
Comparing wrong columns in filter.
✗ Incorrect
Watermarking uses 'timestamp' column; filtering compares 'timestamp' to 'watermark' to exclude late data.