0
0
Apache Sparkdata~10 mins

Watermarking for late data in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Watermarking for late data
Start Streaming Data
Assign Event Time
Set Watermark Threshold
Process Data with Window
Drop Late Data beyond Watermark
Output Cleaned Stream
End
Data flows in with event times; watermark sets a threshold to drop data arriving too late, ensuring timely processing.
Execution Sample
Apache Spark
from pyspark.sql.functions import window

stream.withWatermark("eventTime", "10 minutes")
  .groupBy(window("eventTime", "5 minutes"))
  .count()
This code sets a watermark of 10 minutes on eventTime and counts events in 5-minute windows, dropping data later than 10 minutes.
Execution Table
StepInput Event TimeWatermark ThresholdIs Late?ActionOutput
12024-06-01 10:00:00None (initial)NoAccept eventCount updated for window 10:00-10:05
22024-06-01 10:04:00None (initial)NoAccept eventCount updated for window 10:00-10:05
32024-06-01 09:50:00None (initial)NoAccept eventCount updated for window 09:50-09:55
42024-06-01 10:15:0010:05:00 (eventTime - 10 min)NoAccept eventCount updated for window 10:15-10:20
52024-06-01 09:52:0010:05:00YesDrop event (too late)No change
62024-06-01 10:07:0010:05:00NoAccept eventCount updated for window 10:05-10:10
72024-06-01 09:54:0010:05:00YesDrop event (too late)No change
82024-06-01 10:12:0010:05:00NoAccept eventCount updated for window 10:10-10:15
92024-06-01 10:20:0010:05:00NoAccept eventCount updated for window 10:20-10:25
102024-06-01 09:59:0010:10:00YesDrop event (too late)No change
ExitN/AN/AN/AStream ends or watermark advances beyond all event timesFinal counts for windows
💡 Watermark advances with eventTime minus 10 minutes; events older than watermark are dropped as late.
Variable Tracker
VariableStartAfter Step 4After Step 6After Step 9Final
watermarkNone2024-06-01 10:05:002024-06-01 10:05:002024-06-01 10:10:002024-06-01 10:10:00
accepted_events_count{}{"10:00-10:05":2, "09:50-09:55":1, "10:15-10:20":1}{"10:00-10:05":2, "09:50-09:55":1, "10:05-10:10":1, "10:15-10:20":1}{"10:00-10:05":2, "09:50-09:55":1, "10:05-10:10":1, "10:10-10:15":1, "10:15-10:20":1, "10:20-10:25":1}{"10:00-10:05":2, "09:50-09:55":1, "10:05-10:10":1, "10:10-10:15":1, "10:15-10:20":1, "10:20-10:25":1}
dropped_events_count00123
Key Moments - 2 Insights
Why are some events with earlier timestamps dropped even though they arrive later?
Because the watermark moves forward with the event time minus the allowed delay (10 minutes). Events older than the watermark are considered too late and dropped, as shown in steps 5, 7, and 10 in the execution_table.
Does the watermark move backward if an older event arrives late?
No, the watermark only moves forward or stays the same. Late events do not move the watermark backward, so they get dropped if older than the watermark, as seen in steps 5 and 7.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 5. Why is the event with time 09:52:00 dropped?
ABecause it arrived before the watermark was set
BBecause its event time is older than the watermark 10:05:00
CBecause it belongs to a future window
DBecause the count for that window is already full
💡 Hint
Check the 'Is Late?' and 'Watermark Threshold' columns at step 5 in the execution_table.
At which step does the watermark first get set to 10:05:00?
AStep 1
BStep 3
CStep 4
DStep 6
💡 Hint
Look at the 'Watermark Threshold' column in the execution_table rows.
If the watermark delay was changed from 10 minutes to 5 minutes, what would happen to the event at 09:54:00 arriving at step 7?
AIt would be dropped because the watermark is earlier
BIt would be accepted because the watermark is later
CIt would update the watermark backward
DIt would be counted twice
💡 Hint
Consider how watermark delay affects the threshold for dropping late data, referencing step 7 in the execution_table.
Concept Snapshot
Watermarking for late data in streaming:
- Assign event time to data
- Set watermark delay (e.g., 10 min)
- Watermark = max event time seen - delay
- Drop events older than watermark
- Ensures timely, bounded-lateness processing
Full Transcript
Watermarking helps handle late data in streaming by setting a threshold time. Events with timestamps older than this threshold are dropped to keep processing timely. The watermark moves forward as new data arrives, never backward. This example shows events accepted or dropped based on their event time compared to the watermark, with counts updated for time windows. Late events beyond the watermark are ignored to avoid waiting indefinitely.