0
0
Apache-sparkConceptBeginner · 4 min read

Structured Streaming in Spark with PySpark: What It Is and How It Works

Structured Streaming in Spark is a scalable and fault-tolerant stream processing engine built on Spark SQL. In PySpark, it allows you to process live data streams using the same APIs as batch data, making real-time analytics easier and consistent.
⚙️

How It Works

Structured Streaming works like a smart waiter who keeps checking for new orders (data) and serves them as soon as they arrive. Instead of processing all data at once, it processes data in small chunks called micro-batches or as continuous streams.

It treats streaming data as an unbounded table that keeps growing. You write queries on this table just like you do for static data, and Spark handles the rest—like reading new data, updating results, and recovering from failures automatically.

This approach makes streaming simple and reliable, as you can use familiar SQL-like operations and PySpark DataFrame APIs to analyze live data continuously.

💻

Example

This example shows how to read a stream of JSON files from a folder, count the number of records by a column, and write the results to the console in real time.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate()

# Define the streaming source: JSON files arriving in 'input_data' folder
input_stream = spark.readStream.schema("name STRING, age INT").json("input_data")

# Perform a simple aggregation: count by age
age_counts = input_stream.groupBy("age").count()

# Write the output to console in update mode
query = age_counts.writeStream.outputMode("update").format("console").start()

query.awaitTermination()
Output
------------------------------------------- Batch: 0 ------------------------------------------- age | count ----|------ 25 | 1 ------------------------------------------- Batch: 1 ------------------------------------------- age | count ----|------ 25 | 2 30 | 1
🎯

When to Use

Use Structured Streaming when you need to process data as it arrives, such as monitoring sensor data, tracking user activity on websites, or analyzing financial transactions in real time.

It is ideal for applications that require quick insights from continuous data without waiting for batch jobs to finish. Also, it helps when you want to use the same code for both batch and streaming data, simplifying development and maintenance.

Key Points

  • Structured Streaming uses Spark SQL and DataFrame APIs for stream processing.
  • It processes data incrementally and updates results continuously.
  • Supports fault tolerance and exactly-once processing guarantees.
  • Allows the same code for batch and streaming data.
  • Outputs can be sent to various sinks like console, files, Kafka, or databases.

Key Takeaways

Structured Streaming in PySpark enables real-time data processing using familiar DataFrame APIs.
It treats streaming data as a continuously growing table and updates results incrementally.
Use it for applications needing immediate insights from live data like monitoring or analytics.
It provides fault tolerance and exactly-once guarantees for reliable stream processing.
You can write the same code for batch and streaming, simplifying your data pipelines.