Challenge - 5 Problems
Watermarking Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Watermarking with Late Data
Given the following Spark Structured Streaming code snippet, what will be the output count of events after applying watermarking and window aggregation?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import window spark = SparkSession.builder.appName('WatermarkExample').getOrCreate() # Sample streaming data simulating event time and value input_data = [ ("2024-01-01 10:00:00", 1), ("2024-01-01 10:01:00", 2), ("2024-01-01 10:05:00", 3), # Late event ("2024-01-01 10:02:00", 4) ] schema = "event_time STRING, value INT" # Create DataFrame streaming_df = spark.createDataFrame(input_data, schema=schema) # Convert event_time to timestamp from pyspark.sql.functions import col, to_timestamp streaming_df = streaming_df.withColumn('event_time', to_timestamp(col('event_time'))) # Apply watermark and window aggregation result_df = streaming_df.withWatermark('event_time', '2 minutes') \ .groupBy(window(col('event_time'), '5 minutes')) \ .count() result_df.show(truncate=False)
Attempts:
2 left
💡 Hint
Remember that watermark drops events later than 2 minutes behind the max event time seen so far.
✗ Incorrect
The watermark of 2 minutes means events arriving later than 2 minutes after the max event time are dropped. The event at 10:05:00 is late beyond watermark and excluded, so count is 3.
🧠 Conceptual
intermediate1:30remaining
Understanding Watermarking Purpose
What is the main purpose of watermarking in Apache Spark Structured Streaming?
Attempts:
2 left
💡 Hint
Think about how Spark handles late data and state management.
✗ Incorrect
Watermarking helps Spark drop very late data to avoid unbounded state growth and keep streaming queries efficient.
🔧 Debug
advanced1:30remaining
Identify the Error in Watermark Usage
What error will this Spark code produce?
Apache Spark
df.withWatermark('event_time', '5 minutes')\ .groupBy('event_time')\ .count()
Attempts:
2 left
💡 Hint
Watermarking requires window aggregation to work properly.
✗ Incorrect
Watermarking must be used with window aggregation on event time columns. Grouping directly on event_time without window causes an error.
❓ data_output
advanced2:00remaining
Resulting Rows After Watermark and Window
Given this streaming data and watermark, how many rows will the aggregation output produce?
Apache Spark
input_data = [
("2024-06-01 12:00:00", 10),
("2024-06-01 12:01:00", 20),
("2024-06-01 12:06:00", 30), # Late event
("2024-06-01 12:02:00", 40),
("2024-06-01 12:07:00", 50) # Late event
]
# Watermark set to 3 minutes
# Window duration 5 minutes
# Events beyond watermark threshold are dropped
# How many window groups will be output?Attempts:
2 left
💡 Hint
Consider which events fall into which 5-minute windows and which are dropped by watermark.
✗ Incorrect
Events at 12:06 and 12:07 are late beyond 3 minutes watermark and dropped. Remaining events fall into two 5-minute windows: 12:00-12:05 and 12:05-12:10, but only the first window has events, so 2 windows output.
🚀 Application
expert2:30remaining
Choosing Watermark Duration for Real-Time Analytics
You are designing a real-time dashboard that shows user activity aggregated in 1-minute windows. Users can have network delays causing events to arrive up to 10 minutes late. Which watermark duration is best to balance accuracy and resource use?
Attempts:
2 left
💡 Hint
Think about trade-offs between waiting for late data and keeping system responsive.
✗ Incorrect
A 5-minute watermark balances including most late events while limiting state size and latency. 10 minutes may cause high resource use and latency; 1 minute drops many late events; no watermark risks unbounded state.