Challenge - 5 Problems

🎖️

Watermarking Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of Watermarking with Late Data

Given the following Spark Structured Streaming code snippet, what will be the output count of events after applying watermarking and window aggregation?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import window

spark = SparkSession.builder.appName('WatermarkExample').getOrCreate()

# Sample streaming data simulating event time and value
input_data = [
    ("2024-01-01 10:00:00", 1),
    ("2024-01-01 10:01:00", 2),
    ("2024-01-01 10:05:00", 3),  # Late event
    ("2024-01-01 10:02:00", 4)
]

schema = "event_time STRING, value INT"

# Create DataFrame
streaming_df = spark.createDataFrame(input_data, schema=schema)

# Convert event_time to timestamp
from pyspark.sql.functions import col, to_timestamp
streaming_df = streaming_df.withColumn('event_time', to_timestamp(col('event_time')))

# Apply watermark and window aggregation
result_df = streaming_df.withWatermark('event_time', '2 minutes') \
    .groupBy(window(col('event_time'), '5 minutes')) \
    .count()

result_df.show(truncate=False)

A[{window: [2024-01-01 10:00:00, 10:05:00), count: 4}]

B[]

C[{window: [2024-01-01 10:00:00, 10:05:00), count: 2}]

D[{window: [2024-01-01 10:00:00, 10:05:00), count: 3}]

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Understanding Watermarking Purpose

What is the main purpose of watermarking in Apache Spark Structured Streaming?

ATo increase the window size for aggregations automatically

BTo speed up batch processing by caching data in memory

CTo limit the amount of state data by dropping late events beyond a threshold

DTo guarantee exactly-once processing of all events

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Identify the Error in Watermark Usage

What error will this Spark code produce?

Apache Spark

df.withWatermark('event_time', '5 minutes')\
  .groupBy('event_time')\
  .count()

AAnalysisException: Watermark can only be applied on event time columns used in window aggregation

BRuntimeException: groupBy requires a window function when watermark is used

CTypeError: withWatermark expects a timestamp column but got string

DNo error, code runs successfully

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Resulting Rows After Watermark and Window

Given this streaming data and watermark, how many rows will the aggregation output produce?

Apache Spark

input_data = [
    ("2024-06-01 12:00:00", 10),
    ("2024-06-01 12:01:00", 20),
    ("2024-06-01 12:06:00", 30),  # Late event
    ("2024-06-01 12:02:00", 40),
    ("2024-06-01 12:07:00", 50)   # Late event
]

# Watermark set to 3 minutes
# Window duration 5 minutes

# Events beyond watermark threshold are dropped

# How many window groups will be output?

Attempts:

2 left

🚀 Application

expert

2:30remaining

Choosing Watermark Duration for Real-Time Analytics

You are designing a real-time dashboard that shows user activity aggregated in 1-minute windows. Users can have network delays causing events to arrive up to 10 minutes late. Which watermark duration is best to balance accuracy and resource use?

ASet watermark to 10 minutes to include all late events

BSet watermark to 5 minutes to balance late data and resource constraints

CSet watermark to 1 minute to minimize latency and resource use

DDo not use watermark to ensure no data is dropped

Attempts:

2 left