0
0
Apache Sparkdata~30 mins

Structured Streaming basics in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Structured Streaming basics
📖 Scenario: You work at a company that receives live data about customer orders. You want to process this data as it arrives to get quick insights.
🎯 Goal: Build a simple Structured Streaming application in Apache Spark that reads streaming data from a folder, counts the number of orders per product, and displays the results.
📋 What You'll Learn
Create a streaming DataFrame reading JSON files from a folder
Define a query to count orders by product using Structured Streaming
Start the streaming query and display the output in the console
💡 Why This Matters
🌍 Real World
Companies use Structured Streaming to process live data like orders, sensor readings, or logs to get real-time insights.
💼 Career
Data engineers and data scientists use Structured Streaming to build pipelines that handle continuous data flows efficiently.
Progress0 / 4 steps
1
Create streaming DataFrame from JSON files
Create a streaming DataFrame called orders_stream that reads JSON files from the folder "/path/to/orders" using spark.readStream and .json().
Apache Spark
Need a hint?

Use spark.readStream.json(path) to read streaming JSON data.

2
Define aggregation query to count orders by product
Create a streaming aggregation DataFrame called orders_count that groups orders_stream by the column product and counts the number of orders using .groupBy("product").count().
Apache Spark
Need a hint?

Use groupBy("product").count() on the streaming DataFrame.

3
Start streaming query to output counts to console
Start a streaming query called query on orders_count that writes the output to the console using .writeStream.format("console").start().
Apache Spark
Need a hint?

Use .writeStream.format("console").start() to start the query.

4
Print query status and stop the query
Print the current status of the streaming query query using print(query.status). Then stop the query using query.stop().
Apache Spark
Need a hint?

Use print(query.status) to see the query status and query.stop() to stop it.