0
0
Apache Sparkdata~30 mins

Streaming joins in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Streaming Joins with Apache Spark
📖 Scenario: You work at a company that processes live data streams from two sources: user clicks and user profiles. You want to join these streams to enrich click data with user information in real time.
🎯 Goal: Build a Spark Structured Streaming application that reads two streaming DataFrames, performs a join on user ID, and outputs the enriched click data.
📋 What You'll Learn
Create streaming DataFrames for clicks and user profiles
Set a watermark on event time columns to handle late data
Perform an inner join on user ID between the two streams
Write the joined stream to the console sink
💡 Why This Matters
🌍 Real World
Streaming joins are used in real-time analytics to combine live data from multiple sources, such as user activity and profile information, to provide enriched insights instantly.
💼 Career
Data engineers and data scientists use streaming joins to build pipelines that process and analyze live data for monitoring, personalization, and alerting systems.
Progress0 / 4 steps
1
Create streaming DataFrames for clicks and user profiles
Create two streaming DataFrames called clicks and profiles by reading from JSON files in directories "/path/to/clicks" and "/path/to/profiles" respectively. Use spark.readStream with schema option set to the given schemas. The clicks schema has fields userId (string) and clickTime (timestamp). The profiles schema has fields userId (string) and userName (string).
Apache Spark
Need a hint?

Use spark.readStream.schema(...).json(path) to create streaming DataFrames.

2
Set watermark on event time columns
Add watermarks to the streaming DataFrames to handle late data. Set a watermark of 10 minutes on clicks using the clickTime column, and set a watermark of 10 minutes on profiles using the userId column (even though it is not a timestamp, just follow the instruction for practice). Assign the results back to clicks and profiles.
Apache Spark
Need a hint?

Use withWatermark(column, "10 minutes") on each DataFrame.

3
Perform inner join on userId
Use an inner join to combine clicks and profiles on the userId column. Assign the joined DataFrame to a variable called joined_stream.
Apache Spark
Need a hint?

Use clicks.join(profiles, on="userId", how="inner") to join the streams.

4
Write the joined stream to console
Write the joined_stream to the console sink using writeStream. Set the output mode to append and start the query. Assign the query to a variable called query. Finally, print "Streaming join started".
Apache Spark
Need a hint?

Use writeStream.format("console").outputMode("append").start() and assign to query.