0
0
Apache Sparkdata~30 mins

Output modes (append, complete, update) in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Understanding Output Modes in Apache Spark Structured Streaming
📖 Scenario: You work at a company that processes live sales data from multiple stores. You want to analyze this data in real-time using Apache Spark Structured Streaming. Different output modes control how the results are saved or displayed as new data arrives.
🎯 Goal: Learn how to use the three output modes append, complete, and update in Apache Spark Structured Streaming to control how streaming query results are output.
📋 What You'll Learn
Create a streaming DataFrame from a static DataFrame simulating sales data
Define a trigger interval for streaming
Use output modes: append, complete, and update
Print the streaming query output to the console
💡 Why This Matters
🌍 Real World
Companies use streaming data to monitor sales, website clicks, or sensor data in real-time to make quick decisions.
💼 Career
Understanding output modes in Spark Structured Streaming is essential for data engineers and data scientists working with real-time data pipelines.
Progress0 / 4 steps
1
Create a static DataFrame simulating sales data
Create a Spark DataFrame called sales_data with these exact rows: ("Store1", 100), ("Store2", 150), ("Store1", 200). The columns should be store and amount.
Apache Spark
Need a hint?

Use spark.createDataFrame with a list of tuples and specify the column names as a list.

2
Create a streaming DataFrame from the static DataFrame
Create a streaming DataFrame called streaming_sales by using sales_data.writeStream.format("memory").queryName("sales_table").start() to simulate streaming data.
Apache Spark
Need a hint?

Use writeStream.format("memory") with queryName and start() to create a streaming DataFrame.

3
Apply output modes to a streaming aggregation
Create a streaming aggregation DataFrame called agg_sales by grouping sales_data by store and summing amount. Then write three streaming queries named query_append, query_complete, and query_update using output modes append, complete, and update respectively. Use format("console") and start() for each query.
Apache Spark
Need a hint?

Use groupBy("store").sum("amount") to aggregate. Then use writeStream.outputMode(...).format("console").start() for each output mode.

4
Print the streaming query output
Print the status of the three streaming queries query_append, query_complete, and query_update by using print(query_append.status), print(query_complete.status), and print(query_update.status).
Apache Spark
Need a hint?

Use print(query_append.status) and similarly for the other queries to see their current status.