0
0
Apache Sparkdata~20 mins

Watermarking for late data in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Watermarking Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of Watermarking with Late Data
Given the following Spark Structured Streaming code snippet, what will be the output count of events after applying watermarking and window aggregation?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import window

spark = SparkSession.builder.appName('WatermarkExample').getOrCreate()

# Sample streaming data simulating event time and value
input_data = [
    ("2024-01-01 10:00:00", 1),
    ("2024-01-01 10:01:00", 2),
    ("2024-01-01 10:05:00", 3),  # Late event
    ("2024-01-01 10:02:00", 4)
]

schema = "event_time STRING, value INT"

# Create DataFrame
streaming_df = spark.createDataFrame(input_data, schema=schema)

# Convert event_time to timestamp
from pyspark.sql.functions import col, to_timestamp
streaming_df = streaming_df.withColumn('event_time', to_timestamp(col('event_time')))

# Apply watermark and window aggregation
result_df = streaming_df.withWatermark('event_time', '2 minutes') \
    .groupBy(window(col('event_time'), '5 minutes')) \
    .count()

result_df.show(truncate=False)
A[{window: [2024-01-01 10:00:00, 10:05:00), count: 4}]
B[]
C[{window: [2024-01-01 10:00:00, 10:05:00), count: 2}]
D[{window: [2024-01-01 10:00:00, 10:05:00), count: 3}]
Attempts:
2 left
💡 Hint
Remember that watermark drops events later than 2 minutes behind the max event time seen so far.
🧠 Conceptual
intermediate
1:30remaining
Understanding Watermarking Purpose
What is the main purpose of watermarking in Apache Spark Structured Streaming?
ATo increase the window size for aggregations automatically
BTo speed up batch processing by caching data in memory
CTo limit the amount of state data by dropping late events beyond a threshold
DTo guarantee exactly-once processing of all events
Attempts:
2 left
💡 Hint
Think about how Spark handles late data and state management.
🔧 Debug
advanced
1:30remaining
Identify the Error in Watermark Usage
What error will this Spark code produce?
Apache Spark
df.withWatermark('event_time', '5 minutes')\
  .groupBy('event_time')\
  .count()
AAnalysisException: Watermark can only be applied on event time columns used in window aggregation
BRuntimeException: groupBy requires a window function when watermark is used
CTypeError: withWatermark expects a timestamp column but got string
DNo error, code runs successfully
Attempts:
2 left
💡 Hint
Watermarking requires window aggregation to work properly.
data_output
advanced
2:00remaining
Resulting Rows After Watermark and Window
Given this streaming data and watermark, how many rows will the aggregation output produce?
Apache Spark
input_data = [
    ("2024-06-01 12:00:00", 10),
    ("2024-06-01 12:01:00", 20),
    ("2024-06-01 12:06:00", 30),  # Late event
    ("2024-06-01 12:02:00", 40),
    ("2024-06-01 12:07:00", 50)   # Late event
]

# Watermark set to 3 minutes
# Window duration 5 minutes

# Events beyond watermark threshold are dropped

# How many window groups will be output?
A2
B3
C1
D0
Attempts:
2 left
💡 Hint
Consider which events fall into which 5-minute windows and which are dropped by watermark.
🚀 Application
expert
2:30remaining
Choosing Watermark Duration for Real-Time Analytics
You are designing a real-time dashboard that shows user activity aggregated in 1-minute windows. Users can have network delays causing events to arrive up to 10 minutes late. Which watermark duration is best to balance accuracy and resource use?
ASet watermark to 10 minutes to include all late events
BSet watermark to 5 minutes to balance late data and resource constraints
CSet watermark to 1 minute to minimize latency and resource use
DDo not use watermark to ensure no data is dropped
Attempts:
2 left
💡 Hint
Think about trade-offs between waiting for late data and keeping system responsive.