Apache Sparkdata~30 mins

Streaming joins in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Streaming Joins with Apache Spark

📖 Scenario: You work at a company that processes live data streams from two sources: user clicks and user profiles. You want to join these streams to enrich click data with user information in real time.

🎯 Goal: Build a Spark Structured Streaming application that reads two streaming DataFrames, performs a join on user ID, and outputs the enriched click data.

📋 What You'll Learn

Create streaming DataFrames for clicks and user profiles

Set a watermark on event time columns to handle late data

Perform an inner join on user ID between the two streams

Write the joined stream to the console sink

💡 Why This Matters

🌍 Real World

Streaming joins are used in real-time analytics to combine live data from multiple sources, such as user activity and profile information, to provide enriched insights instantly.

💼 Career

Data engineers and data scientists use streaming joins to build pipelines that process and analyze live data for monitoring, personalization, and alerting systems.

Progress0 / 4 steps

Create streaming DataFrames for clicks and user profiles

Create two streaming DataFrames called clicks and profiles by reading from JSON files in directories "/path/to/clicks" and "/path/to/profiles" respectively. Use spark.readStream with schema option set to the given schemas. The clicks schema has fields userId (string) and clickTime (timestamp). The profiles schema has fields userId (string) and userName (string).

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

spark = SparkSession.builder.appName("StreamingJoins").getOrCreate()

clicks_schema = StructType([
    StructField("userId", StringType()),
    StructField("clickTime", TimestampType())
])

profiles_schema = StructType([
    StructField("userId", StringType()),
    StructField("userName", StringType())
])

# Create streaming DataFrame clicks reading from "/path/to/clicks" with clicks_schema
# Create streaming DataFrame profiles reading from "/path/to/profiles" with profiles_schema
# Your code here

Need a hint?

Use spark.readStream.schema(...).json(path) to create streaming DataFrames.

Set watermark on event time columns

Add watermarks to the streaming DataFrames to handle late data. Set a watermark of 10 minutes on clicks using the clickTime column, and set a watermark of 10 minutes on profiles using the userId column (even though it is not a timestamp, just follow the instruction for practice). Assign the results back to clicks and profiles.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

spark = SparkSession.builder.appName("StreamingJoins").getOrCreate()

clicks_schema = StructType([
    StructField("userId", StringType()),
    StructField("clickTime", TimestampType())
])

profiles_schema = StructType([
    StructField("userId", StringType()),
    StructField("userName", StringType())
])

clicks = spark.readStream.schema(clicks_schema).json("/path/to/clicks")
profiles = spark.readStream.schema(profiles_schema).json("/path/to/profiles")

# Add watermark of 10 minutes on clicks using clickTime
# Add watermark of 10 minutes on profiles using userId
# Your code here

Need a hint?

Use withWatermark(column, "10 minutes") on each DataFrame.

Perform inner join on userId

Use an inner join to combine clicks and profiles on the userId column. Assign the joined DataFrame to a variable called joined_stream.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

spark = SparkSession.builder.appName("StreamingJoins").getOrCreate()

clicks_schema = StructType([
    StructField("userId", StringType()),
    StructField("clickTime", TimestampType())
])

profiles_schema = StructType([
    StructField("userId", StringType()),
    StructField("userName", StringType())
])

clicks = spark.readStream.schema(clicks_schema).json("/path/to/clicks")
profiles = spark.readStream.schema(profiles_schema).json("/path/to/profiles")

clicks = clicks.withWatermark("clickTime", "10 minutes")
profiles = profiles.withWatermark("userId", "10 minutes")

# Join clicks and profiles on userId using inner join
# Assign the result to joined_stream
# Your code here

Need a hint?

Use clicks.join(profiles, on="userId", how="inner") to join the streams.

Write the joined stream to console

Write the joined_stream to the console sink using writeStream. Set the output mode to append and start the query. Assign the query to a variable called query. Finally, print "Streaming join started".

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

spark = SparkSession.builder.appName("StreamingJoins").getOrCreate()

clicks_schema = StructType([
    StructField("userId", StringType()),
    StructField("clickTime", TimestampType())
])

profiles_schema = StructType([
    StructField("userId", StringType()),
    StructField("userName", StringType())
])

clicks = spark.readStream.schema(clicks_schema).json("/path/to/clicks")
profiles = spark.readStream.schema(profiles_schema).json("/path/to/profiles")

clicks = clicks.withWatermark("clickTime", "10 minutes")
profiles = profiles.withWatermark("userId", "10 minutes")

joined_stream = clicks.join(profiles, on="userId", how="inner")

# Write joined_stream to console with outputMode 'append' and start the query
# Assign the query to variable query
# Print "Streaming join started"
# Your code here

Need a hint?

Use writeStream.format("console").outputMode("append").start() and assign to query.