Watermarking for late data
📖 Scenario: You work for a company that processes streaming data of user clicks on a website. Sometimes, data arrives late due to network delays. To handle this, you want to use watermarking to ignore very late data and keep your analysis accurate.
🎯 Goal: Build a Spark Structured Streaming job that reads click events, applies watermarking on event time to handle late data, and counts clicks per user in a time window.
📋 What You'll Learn
Create a streaming DataFrame with sample click data including event timestamps
Set a watermark on the event time column with a delay threshold
Group data by user and time window to count clicks
Output the aggregated counts
💡 Why This Matters
🌍 Real World
Watermarking helps streaming systems ignore very late data that can cause incorrect results, improving data accuracy in real-time analytics.
💼 Career
Data engineers and data scientists use watermarking in Spark Structured Streaming to manage late-arriving data in event-time processing pipelines.
Progress0 / 4 steps