Apache Sparkdata~30 mins

Watermarking for late data in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Watermarking for late data

📖 Scenario: You work for a company that processes streaming data of user clicks on a website. Sometimes, data arrives late due to network delays. To handle this, you want to use watermarking to ignore very late data and keep your analysis accurate.

🎯 Goal: Build a Spark Structured Streaming job that reads click events, applies watermarking on event time to handle late data, and counts clicks per user in a time window.

📋 What You'll Learn

Create a streaming DataFrame with sample click data including event timestamps

Set a watermark on the event time column with a delay threshold

Group data by user and time window to count clicks

Output the aggregated counts

💡 Why This Matters

🌍 Real World

Watermarking helps streaming systems ignore very late data that can cause incorrect results, improving data accuracy in real-time analytics.

💼 Career

Data engineers and data scientists use watermarking in Spark Structured Streaming to manage late-arriving data in event-time processing pipelines.

Progress0 / 4 steps

Create streaming DataFrame with click data

Create a Spark streaming DataFrame called clicks with the following schema and data: columns user (string) and event_time (timestamp). Use the following data rows: ("Alice", "2024-06-01 12:00:00"), ("Bob", "2024-06-01 12:01:00"), ("Alice", "2024-06-01 12:02:00"). Use spark.createDataFrame and to_timestamp to convert strings to timestamps.

Apache Spark

# Create the streaming DataFrame called clicks with user and event_time columns
# Your code here

Need a hint?

Use spark.createDataFrame with a list of tuples and column names. Then use withColumn and to_timestamp to convert the string column to timestamp.

Set watermark delay threshold

Create a variable called watermark_delay and set it to the string "1 minute". This will be the delay threshold for watermarking.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp

spark = SparkSession.builder.appName("WatermarkExample").getOrCreate()
data = [("Alice", "2024-06-01 12:00:00"), ("Bob", "2024-06-01 12:01:00"), ("Alice", "2024-06-01 12:02:00")]
clicks = spark.createDataFrame(data, ["user", "event_time_str"]).withColumn("event_time", to_timestamp("event_time_str")).drop("event_time_str")

# Create watermark_delay variable with value "1 minute"
# Your code here

Need a hint?

Just create a variable named watermark_delay and assign the string "1 minute".

Apply watermark and group by user and window

Use the clicks DataFrame and apply withWatermark on the event_time column with the delay threshold watermark_delay. Then group by user and a 2-minute time window on event_time using window. Aggregate by counting the number of clicks per group and save the result in a DataFrame called click_counts. Import window from pyspark.sql.functions.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp

spark = SparkSession.builder.appName("WatermarkExample").getOrCreate()
data = [("Alice", "2024-06-01 12:00:00"), ("Bob", "2024-06-01 12:01:00"), ("Alice", "2024-06-01 12:02:00")]
clicks = spark.createDataFrame(data, ["user", "event_time_str"]).withColumn("event_time", to_timestamp("event_time_str")).drop("event_time_str")

watermark_delay = "1 minute"

# Import window function
from pyspark.sql.functions import window

# Apply watermark and group by user and 2-minute window, count clicks
# Your code here

Need a hint?

Use withWatermark on event_time with watermark_delay. Then group by user and window(event_time, "2 minutes"). Finally, use count() to aggregate.

Show the aggregated click counts

Use click_counts.show() to display the aggregated counts of clicks per user and time window.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp, window

spark = SparkSession.builder.appName("WatermarkExample").getOrCreate()
data = [("Alice", "2024-06-01 12:00:00"), ("Bob", "2024-06-01 12:01:00"), ("Alice", "2024-06-01 12:02:00")]
clicks = spark.createDataFrame(data, ["user", "event_time_str"]).withColumn("event_time", to_timestamp("event_time_str")).drop("event_time_str")

watermark_delay = "1 minute"

click_counts = clicks.withWatermark("event_time", watermark_delay).groupBy("user", window("event_time", "2 minutes")).count()

# Show the click_counts DataFrame
# Your code here

Need a hint?

Use click_counts.show() to print the result table.